How I train an Named Entity Recognizer identifier in OpenNLP?
Asked Answered
V

1

10

Ok, I have the following code to train the NER Identifier from OpenNLP

FileReader fileReader = new FileReader("train.txt");
ObjectStream fileStream = new PlainTextByLineStream(fileReader);
ObjectStream sampleStream = new NameSampleDataStream(fileStream);
TokenNameFinderModel model = NameFinderME.train("pt-br", "train", sampleStream, Collections.<String, Object>emptyMap());
nfm = new NameFinderME(model); 

I don't know if I'm doing something wrong of if something is missing, but the classifying is not working. I'm supposing that the train.txt is wrong.

The error that occurs is that all tokens are classified to only one type.

My train.txt data is something like the following example, but with a lot more of variation and quantity of entries. Another thing is that I'm classifind word by word from a text per time, and not all tokens.

<START:distance> 8000m <END>
<START:temperature> 100ºC <END>
<START:weight> 50kg <END>
<START:name> Renato <END>

Somebody can show what I doing wrong?

Vantassel answered 5/8, 2011 at 6:51 Comment(3)
Could you please tell me, what version of OpenNLP are you using? Because i am using OpenNLP 1.5.1 and there is no model file for temparatue, distance and weight.Hinojosa
@raj.singh I'm not using OpenNLP. I'm coding my own classifier for my purposes now.Vantassel
hi @Renato Dinhani, i have same problem, can you help me, how you have solve this in your application. i got this error when i am going to us my train.txt exception : java.security.NoSuchAlgorithmExceptionConias
P
23

Your training data is not OK.

You should put all entities in a context inside a sentence:

At an altitude of <START:distance> 8000m <END> the temperature of boiling water is less than <START:temperature> 100ºC <END> .
The climber <START:name> Renato <END> is carrying <START:weight> 50kg <END> of equipment.

You will have better results if your training data derives from real world sentences and have the same style of the sentences you are classifying. For example you should train using a newspaper corpus if you will process news.

Also you will need thousands of sentences to build your model! Maybe you can start with a hundred to bootstrap and use the poor model to improve your corpus and train your model again.

And of course you should classify all tokens of a sentence, otherwise there will be no context to decide the type of an entity.

Priedieu answered 5/8, 2011 at 9:45 Comment(2)
hi wcolen, i have same problem, can you give me some link or example for sentence train.Conias
@Riddhish.Chaudhari, see the example here: svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/… . You should have one sentence per line, and a blank line for new paragraph.Priedieu

© 2022 - 2024 — McMap. All rights reserved.