How to create a good NER training model in OpenNLP?

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group . <START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .

The answer to your first question is that the algorithm works on surrounding context(tokens) within a sentence; it's not just a simple lookup mechanism. OpenNLP uses maximum entropy, which is a form of multinomial logistic regression to build its model. The reason for this is to reduce "word sense ambiguity," and find entities in context. For instance, if my name is April, I can easily get confused with the month of April, and if my name is May, then I would get confused with the month of May as well as the verb may. For your second part of the first question, you could make a list of names that are known, and use those names in a program that looks at your sentences and automatically annotates them to help you create a training set, however making a list of names alone without context will not train the model sufficiently or at all. In fact, there is an OpenNLP addon called the "modelbuilder addon" designed for this: you give it a file of names, and it uses the names and some of your data (sentences) to train a model. If you are looking for particular names of generally non ambiguous entities, you may be better off just using a list and something like regex to discover names rather than NER.

As for your second question there are a few options, but in general, I don't think NER is a great tool for delineating something like gender, however with enough training sentences you may get decent results. Since NER uses a model based on surrounding tokens in your sentence training set to establish the existence of a named entity, it can't do much in terms of identifying gender. You may be better off finding all person names, then referencing an index of names that you know are male or female to get a match. Also, some names, like Pat, are both male and female, and in most textual data there will be no indication of which it is to neither human nor machine. That being said, you could create a male and female model separately, or you could create different entity types within the same model. You could use an annotation like this (using different entity type names of male.person and female.person). I've never tried this but it might do ok, you'd have to test it on your data.

<START:male.person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mrs . <START:female.person> Maria <END> is chairman of Elsevier N.V. , the Dutch publishing group

NER= Named Entity Recognition

HTH

Recommended topics

Hot tags