Training n-gram NER with Stanford NLP

Asked 25/3, 2013 at 6:59 Answered 4/10, 2016 at 4:40

Solved nlp stanford-nlp opennlp named-entity-recognition named-entity-extraction

Recently I have been trying to train n-gram entities with Stanford Core NLP. I have followed the following tutorials - http://nlp.stanford.edu/software/crf-faq.shtml#b

With this, I am able to specify only unigram tokens and the class it belongs to. Can any one guide me through so that I can extend it to n-grams. I am trying to extract known entities like movie names from chat data set.

Please guide me through in case I have mis-interpretted the Stanford Tutorials and the same can be used for the n-gram training.

What I am stuck with is the following property

#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1

Here the first column is the word (unigram) and the second column is the entity, for example

CHAPTER O
I   O
Emma    PERS
Woodhouse   PERS

Now that I need to train known entities (say movie names) like Hulk, Titanic etc as movies, it would be easy with this approach. But in case I need to train I know what you did last summer or Baby's day out, what is the best approach ?

Distress answered 25/3, 2013 at 6:59 Comment(2)

Dear @Arun did you succeeded to train NER for n-grams? I want to train education like , Master in Science : EDUCATION , PhD in Electronics : EDUCATION. Can you guide me? Thanks – Lupercalia 19/1, 2017 at 13:43

@KhalidUsman, Thanks for reaching out. I have used LingPipe as in below answer to achieve this. Worked very well with pretty decent volume of training dataset. Any model would work fine only depending on how good the data set you provide it to learn. – Distress 19/1, 2017 at 16:48

It had been a long wait here for an answer. I have not been able to figure out the way to get it done using Stanford Core. However mission accomplished. I have used the LingPipe NLP libraries for the same. Just quoting the answer here because, I think someone else could benefit from it.

Please check out the Lingpipe licencing before diving in for an implementation in case you are a developer or researcher or what ever.

Lingpipe provides various NER methods.

1) Dictionary Based NER

2) Statistical NER (HMM Based)

3) Rule Based NER etc.

I have used the Dictionary as well as the statistical approaches.

First one is a direct look up methodology and the second one being a training based.

An example for the dictionary based NER can be found here

The statstical approach requires a training file. I have used the file with the following format -

<root>
<s> data line with the <ENAMEX TYPE="myentity">entity1</ENAMEX>  to be trained</s>
...
<s> with the <ENAMEX TYPE="myentity">entity2</ENAMEX>  annotated </s>
</root>

I then used the following code to train the entities.

import java.io.File;
import java.io.IOException;

import com.aliasi.chunk.CharLmHmmChunker;
import com.aliasi.corpus.parsers.Muc6ChunkParser;
import com.aliasi.hmm.HmmCharLmEstimator;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.AbstractExternalizable;

@SuppressWarnings("deprecation")
public class TrainEntities {

    static final int MAX_N_GRAM = 50;
    static final int NUM_CHARS = 300;
    static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior

    public static void main(String[] args) throws IOException {
        File corpusFile = new File("inputfile.txt");// my annotated file
        File modelFile = new File("outputmodelfile.model"); 

        System.out.println("Setting up Chunker Estimator");
        TokenizerFactory factory
            = IndoEuropeanTokenizerFactory.INSTANCE;
        HmmCharLmEstimator hmmEstimator
            = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
        CharLmHmmChunker chunkerEstimator
            = new CharLmHmmChunker(factory,hmmEstimator);

        System.out.println("Setting up Data Parser");
        Muc6ChunkParser parser = new Muc6ChunkParser();  
        parser.setHandler( chunkerEstimator);

        System.out.println("Training with Data from File=" + corpusFile);
        parser.parse(corpusFile);

        System.out.println("Compiling and Writing Model to File=" + modelFile);
        AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
    }

}

And to test the NER I used the following class

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Set;

import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunker;
import com.aliasi.chunk.Chunking;
import com.aliasi.util.AbstractExternalizable;

public class Recognition {
    public static void main(String[] args) throws Exception {
        File modelFile = new File("outputmodelfile.model");
        Chunker chunker = (Chunker) AbstractExternalizable
                .readObject(modelFile);
        String testString="my test string";
            Chunking chunking = chunker.chunk(testString);
            Set<Chunk> test = chunking.chunkSet();
            for (Chunk c : test) {
                System.out.println(testString + " : "
                        + testString.substring(c.start(), c.end()) + " >> "
                        + c.type());

        }
    }
}

Code Courtesy : Google :)

Distress answered 15/4, 2013 at 14:15 Comment(19)

tech.groups.yahoo.com/group/LingPipe/message/68 provides more information on the corpus preparation. – Distress 10/5, 2013 at 5:50

I also tried the same code. Can u plz mention how did u prepare the training set.I added this as a text file and tried to add my own entity but it's not working ...plz help me .I don't know if I had misinterpreted about the training set – Callipash 19/4, 2014 at 17:28

The <ENAMEX TYPE="ORGANIZATION">USAir</ENAMEX> flight attendant in the rear of the plane making a short flight to <ENAMEX TYPE="LOCATION">Charlotte</ENAMEX>, <ENAMEX TYPE="LOCATION">N.C.</ENAMEX>, kept peeking around the corner of a seat in Row 21, making 9-month-old <ENAMEX TYPE="PERSON">Danasia Brown</ENAMEX> laugh. – Callipash 19/4, 2014 at 17:32

The training set used is of the same format that I have discussed above. You would require quite a lot of data for the model to 'learn'. Probably some news articles or wiki pages etc etc in well formed sentences. – Distress 20/4, 2014 at 5:3

Please check out the entire discussions at groups.yahoo.com/neo/groups/LingPipe/conversations/topics/68 – Distress 20/4, 2014 at 5:4

thank u very much Arun. I got it and one more doubt currently this program identifies only one user defined entity. can i make it in a way such that it identifies all entities in a text – Callipash 20/4, 2014 at 9:28

Yes you can... please go ahead and add as many entities you want in the same input file. – Distress 20/4, 2014 at 10:37

I have added DAY as my entity and added many but if I give this as input <ENAMEX>Tuesday</ENAMEX> it shows incorrect output as LOCATION instead of giving o/p as DAY.If same word for eg DELHI apperas more than once in a document is there a need for it to be redeclared as LOCATION.I had added many to training set but if I gave anything as input that was already in training set sometimes it gives DUAL O/P as DAY AND LOCATION. I don't know what went wrong – Callipash 21/4, 2014 at 4:3

is it mandatory that each time when we add a news it should be put inside<s>...</s> tags or a common <s> tag is enough. I am not getting the correct output for some entities – Callipash 21/4, 2014 at 4:41

let us continue this discussion in chat – Callipash 21/4, 2014 at 4:47

@ArunAK can u please show a small snippet of your training set. My pgrm is not working and identifying entities and I think it may be because of any fault in the training set. – Chockablock 10/7, 2014 at 4:59

@chopu : what format have you ued? Can you validate across chat.stackoverflow.com/rooms/51072/… All you need is a file with a start and end tag like <root> and </root> and each sentence in between <s> and </s>. Whatever entity you want to 'teach' should go between the enamex tags – Distress 10/7, 2014 at 5:40

Try to download some tagged data sets because, hand prepared ones would be too meager for it to learn. Basically it is expected to learn from context, or from features... where features could be adjacent words, upper/lower casing, punctuations etc. So real world data would be a better choice – Distress 10/7, 2014 at 5:46

<root><s> The burglar used weapons like <ENAMEX TYPE="WEAPON">riffles</ENAMEX></s>.<s> Policemen are seen working in a jewellery store that was attacked using <ENAMEX TYPE="WEAPON">pistols</ENAMEX> .</s></root> – Chockablock 10/7, 2014 at 8:20

I want to identify weapons in an input.The above one is the small snippet of my training set. The problem is that sometimes it identifies some weapons and also if more than one weapon is there it will not identify that. – Chockablock 10/7, 2014 at 8:27

@chopu - no guarantee on the small data size. Lingpipe yahoo forum had one discussion on the training data set size. – Distress 10/7, 2014 at 19:51

@ArunAK It is my first ever project in this field, Would you please like to guide me on skype or email etc. Email: [email protected]. Your guidance will be appreciated. Thanks – Lupercalia 24/1, 2017 at 11:8

@ArunAK I used your above code and i get the following issue. "Muc6ChunkParser cannot be resolved to a type" – Lupercalia 25/1, 2017 at 10:18

@ArunAK How did you given input in text file, Its working fine now on genetag example but not working on my custom given input in text file. Master of Science in Biomedical Sciences EDUCATION Major in Research EDUCATION Bachelor of Science (B.S.) EDUCATION Biomedical Sciences EDUCATION PhD EDUCATION Master EDUCATION Graduated Nursing EDUCATION Post-graduate degree EDUCATION Bachelor EDUCATION Bachelor's degree - RN EDUCATION Master of Science (MSc) EDUCATION – Lupercalia 30/1, 2017 at 13:58

The answer is basically given in your quoted example, where "Emma Woodhouse" is a single name. The default models we supply use IO encoding, and assume that adjacent tokens of the same class are part of the same entity. In many circumstances, this is almost always true, and keeps the models simpler. However, if you don't want to do that you can train NER models with other label encodings, such as the commonly used IOB encoding, where you would instead label things:

Emma    B-PERSON
Woodhouse    I-PERSON

Then, adjacent tokens of the same category but not the same entity can be represented.

Handbook answered 10/7, 2013 at 3:40 Comment(3)

Thanks @Chris, Let me try creating a new model with this encoding format. – Distress 11/7, 2013 at 6:19

@ChristopherManning how do I enable IOB encoding in NER? Thx – Homoiousian 30/1, 2014 at 21:54

I provide a discussion of options for IOB encoding in my answer to this question: #21469582 – Handbook 23/2, 2014 at 3:58

I faced the same challenge of tagging ngram phrases for automative domain.I was looking for an efficient keyword mapping that can be used to create training files at a later stage. I ended up using regexNER in the NLP pipeline, by providing a mapping file with the regular expressions (ngram component terms) and their corresponding label. Note that there is no NER machine learning achieved in this case. Hope this information helps someone!

Squire answered 4/10, 2016 at 4:40 Comment(0)

Recommended topics

Hot tags