Train NER model in NLTK with custom corpus

About

Asked 9/3, 2017 at 21:55 Answered 26/9, 2017 at 18:32

python nlp nltk named-entity-recognition

I have an annotated corpus in the conll2002 format, namely a tab separated file with a token, pos-tag, and IOB tag followed by entity tag. Example:

John NNP B-PERSON

I want to train a portuguese NER model in NLTK, preferably the MaxEnt model. I do not want to use the "built-in" Stanford NER in NLTK since I was already able to use the stand-alone Stanford NER. I want to use the MaxEnt model to use as comparison to the Stanford NER.

I found NLTK-trainer but I wasn't able to use it.

How can I achieve this?

Raiment answered 9/3, 2017 at 21:55 Comment(0)

Chapters 6 and 7 of the nltk book explain how to train a "chunker" on an IOB-encoded corpus. The example in chapter 7 does NP chunking, but that's incidental-- your chunker will chunk whatever you train it on. You'll need to decide what features are useful for named entity recognition; chapter 6 covers the basics of choosing features for a classifier. Finally, look at the source for the features used by the nltk's own named entity chunker. They'll probably do a pretty good job in Portuguese too; then you can try adding stemming or other Portuguese-specific features.

Foreclosure answered 26/9, 2017 at 18:32 Comment(3)

Thank you, I managed to figure it out eventually, check my github repository for more info on this. – Raiment 2/10, 2017 at 15:48

Glad to hear that. If my answer solved your problem, please "accept" it by clicking on the check mark. – Foreclosure 21/3, 2018 at 15:40

PS. Took a look at your page. That's pretty abysmal performance you got so far... – Foreclosure 21/3, 2018 at 15:42

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags