NLTK named entity recognition in dutch

About

Asked 2/7, 2012 at 11:54 Answered 6/7, 2012 at 1:43

Solved python nlp nltk named-entity-recognition

I am trying to extract named entities from dutch text. I used nltk-trainer to train a tagger and a chunker on the conll2002 dutch corpus. However, the parse method from the chunker is not detecting any named entities. Here is my code:

str = 'Christiane heeft een lam.'

tagger = nltk.data.load('taggers/dutch.pickle')
chunker = nltk.data.load('chunkers/dutch.pickle')

str_tags = tagger.tag(nltk.word_tokenize(str))
print str_tags

str_chunks = chunker.parse(str_tags)
print str_chunks

And the output of this program:

[('Christiane', u'N'), ('heeft', u'V'), ('een', u'Art'), ('lam', u'Adj'), ('.', u'Punc')]
(S Christiane/N heeft/V een/Art lam/Adj ./Punc)

I was expecting Christiane to be detected as a named entity. Any help?

Brookner answered 2/7, 2012 at 11:54 Comment(4)

What happens when "Christiane" appears in the middle of the sentence? – Yurikoyursa 2/7, 2012 at 13:19

@larsmans No entities either. I even tried with a sentence from the training corpus, but no luck. I used the train_chunker.py on the conll2002 corpus (ned.train) – Brookner 2/7, 2012 at 13:33

Can you show exactly how you used train_chunker.py? My demo at text-processing.com/demo/tag recognizes Christiane, of course I used train_chunker on conll2002, so there must be a difference in the training arguments. – Darrin 3/7, 2012 at 23:53

@Darrin I did python train_chunker.py conll2002 . I also tried python train_chunker.py conll2002 --classifier Maxent , but, after 40 minutes or so, got ValueError: setting an array element with a sequence. . How did you train your classifier? – Brookner 4/7, 2012 at 9:8

The conll2002 corpus has both spanish and dutch text, so you should make sure to use the fileids parameter, as in python train_chunker.py conll2002 --fileids ned.train. Training on both spanish and dutch will have poor results.

The default algorithm is a Tagger based Chunker, which does not work well on conll2002. Instead, use a classifier based chunker like NaiveBayes, so the full command might look like this (and I've confirmed that the resulting chunker does recognize "Christiane" as a "PER"):

python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes --filename ~/nltk_data/chunkers/conll2002_ned_NaiveBayes.pickle

Darrin answered 6/7, 2012 at 1:43 Comment(1)

I've reproduced the problem in question, and it occurs even if the tagger and chunker are trained only on ned.train. Moreover, the chunker seems unable to identify any NEs even on the sentences from the training corpus with the gold POS-tags. – Broider 8/7, 2012 at 15:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags