How to create a good NER training model in OpenNLP?
Asked Answered
C

1

18

I just have started with OpenNLP. I need to create a simple training model to recognize name entities.

Reading the doc here https://opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind I see this simple text to train the model:

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
<START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC ,
    was named a director of this British industrial conglomerate .

The questions are two:

  • Why should i have to put the names of the persons in a text (phrase) context ? Why not write person's name one for each line? like:

    <START:person> Robert <END>
    
    <START:person> Maria <END>
    
    <START:person> John <END>
    
  • How can I also add extra information to that name? For example I would like to save the information Male/Female for each name.

(I know there are systems that try to understand it reading the last letter, like the "a" for Female etc but i would like to add it myself)

Thanks.

Chilblain answered 14/8, 2015 at 13:43 Comment(0)
C
20

The answer to your first question is that the algorithm works on surrounding context(tokens) within a sentence; it's not just a simple lookup mechanism. OpenNLP uses maximum entropy, which is a form of multinomial logistic regression to build its model. The reason for this is to reduce "word sense ambiguity," and find entities in context. For instance, if my name is April, I can easily get confused with the month of April, and if my name is May, then I would get confused with the month of May as well as the verb may. For your second part of the first question, you could make a list of names that are known, and use those names in a program that looks at your sentences and automatically annotates them to help you create a training set, however making a list of names alone without context will not train the model sufficiently or at all. In fact, there is an OpenNLP addon called the "modelbuilder addon" designed for this: you give it a file of names, and it uses the names and some of your data (sentences) to train a model. If you are looking for particular names of generally non ambiguous entities, you may be better off just using a list and something like regex to discover names rather than NER.

As for your second question there are a few options, but in general, I don't think NER is a great tool for delineating something like gender, however with enough training sentences you may get decent results. Since NER uses a model based on surrounding tokens in your sentence training set to establish the existence of a named entity, it can't do much in terms of identifying gender. You may be better off finding all person names, then referencing an index of names that you know are male or female to get a match. Also, some names, like Pat, are both male and female, and in most textual data there will be no indication of which it is to neither human nor machine. That being said, you could create a male and female model separately, or you could create different entity types within the same model. You could use an annotation like this (using different entity type names of male.person and female.person). I've never tried this but it might do ok, you'd have to test it on your data.

<START:male.person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mrs . <START:female.person> Maria <END> is chairman of Elsevier N.V. , the Dutch publishing group

NER= Named Entity Recognition

HTH

Centenary answered 17/8, 2015 at 18:38 Comment(11)
Thanks! Yes i should follow your example doing female.person and male.person, the problem here is that i have many names and surnames (around 200k) so in this case should i write the same sentences again and again with a different name each time?, like <START:male.person> Pierre Vinken <END> , 61 years old .. then <START:male.person> John Travolta <END> , 61 years old ... etc etc?Chilblain
do we need to follow this way?Chilblain
Are you able to find names with spaces in between. ?Trophic
@RajkumarPalani yes it will find names with spaces if you train with themCentenary
@markg But to find names with spaces,we have to build our own Tokenizer model. Otherwise, the names will be split and our model cannot recognize them. I just found this when I tried with my own .bin file and en-token.bin, it gave different results.Am I correct?Bavardage
@Nuwanda No, that is not correct actually. The model is based on your annotated sentences, and the tokenizer is just the way opennlp looks at your sentence. However, it is not uncommon for the NER to find only part of a name even if you trained it on multipart names with spaces...NLP is hard :-)Centenary
@markg so,what can I do to make it work better other than giving it loads of training data? Any good practices??Bavardage
Well, I once combined output from the chunker, so if, say, you're looking for people's multipart names, you could detect a name, then also see if it is inside of a noun phrase, and if it is, take the whole noun phrase, which may get you the whole name including spaces.... but even with this I had to do some cleanup on the noun phrases, and the chunker can be wrong as well, especially with sentences it wasn't trained similarly on. Otherwise, more training data is about all you can do other than different heuristics like what I just describedCentenary
@markg Hey Mark, could you please answer this question! #37384009Bavardage
@markg thanks for the informative answer. Curious though how important is the specific name within the training sentence? If there are a billion different names, I might as well put XYZ instead of Pierre Vinken no?Pend
@FredrikL good question, I'm actually not sure how much the actual names factor in.... I'd have to test that. If it does matter you may be able to do some substitution with random names (the US Census is a great source of first and last names)Centenary

© 2022 - 2024 — McMap. All rights reserved.