Train NER model in NLTK with custom corpus
Asked Answered
R

1

8

I have an annotated corpus in the conll2002 format, namely a tab separated file with a token, pos-tag, and IOB tag followed by entity tag. Example:

John NNP B-PERSON

I want to train a portuguese NER model in NLTK, preferably the MaxEnt model. I do not want to use the "built-in" Stanford NER in NLTK since I was already able to use the stand-alone Stanford NER. I want to use the MaxEnt model to use as comparison to the Stanford NER.

I found NLTK-trainer but I wasn't able to use it.

How can I achieve this?

Raiment answered 9/3, 2017 at 21:55 Comment(0)
F
5

Chapters 6 and 7 of the nltk book explain how to train a "chunker" on an IOB-encoded corpus. The example in chapter 7 does NP chunking, but that's incidental-- your chunker will chunk whatever you train it on. You'll need to decide what features are useful for named entity recognition; chapter 6 covers the basics of choosing features for a classifier. Finally, look at the source for the features used by the nltk's own named entity chunker. They'll probably do a pretty good job in Portuguese too; then you can try adding stemming or other Portuguese-specific features.

Foreclosure answered 26/9, 2017 at 18:32 Comment(3)
Thank you, I managed to figure it out eventually, check my github repository for more info on this.Raiment
Glad to hear that. If my answer solved your problem, please "accept" it by clicking on the check mark.Foreclosure
PS. Took a look at your page. That's pretty abysmal performance you got so far...Foreclosure

© 2022 - 2024 — McMap. All rights reserved.