NLTK named entity recognition in dutch
Asked Answered
B

1

8

I am trying to extract named entities from dutch text. I used nltk-trainer to train a tagger and a chunker on the conll2002 dutch corpus. However, the parse method from the chunker is not detecting any named entities. Here is my code:

str = 'Christiane heeft een lam.'

tagger = nltk.data.load('taggers/dutch.pickle')
chunker = nltk.data.load('chunkers/dutch.pickle')

str_tags = tagger.tag(nltk.word_tokenize(str))
print str_tags

str_chunks = chunker.parse(str_tags)
print str_chunks

And the output of this program:

[('Christiane', u'N'), ('heeft', u'V'), ('een', u'Art'), ('lam', u'Adj'), ('.', u'Punc')]
(S Christiane/N heeft/V een/Art lam/Adj ./Punc)

I was expecting Christiane to be detected as a named entity. Any help?

Brookner answered 2/7, 2012 at 11:54 Comment(4)
What happens when "Christiane" appears in the middle of the sentence?Yurikoyursa
@larsmans No entities either. I even tried with a sentence from the training corpus, but no luck. I used the train_chunker.py on the conll2002 corpus (ned.train)Brookner
Can you show exactly how you used train_chunker.py? My demo at text-processing.com/demo/tag recognizes Christiane, of course I used train_chunker on conll2002, so there must be a difference in the training arguments.Darrin
@Darrin I did python train_chunker.py conll2002 . I also tried python train_chunker.py conll2002 --classifier Maxent , but, after 40 minutes or so, got ValueError: setting an array element with a sequence. . How did you train your classifier?Brookner
D
7

The conll2002 corpus has both spanish and dutch text, so you should make sure to use the fileids parameter, as in python train_chunker.py conll2002 --fileids ned.train. Training on both spanish and dutch will have poor results.

The default algorithm is a Tagger based Chunker, which does not work well on conll2002. Instead, use a classifier based chunker like NaiveBayes, so the full command might look like this (and I've confirmed that the resulting chunker does recognize "Christiane" as a "PER"):

python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes --filename ~/nltk_data/chunkers/conll2002_ned_NaiveBayes.pickle

Darrin answered 6/7, 2012 at 1:43 Comment(1)
I've reproduced the problem in question, and it occurs even if the tagger and chunker are trained only on ned.train. Moreover, the chunker seems unable to identify any NEs even on the sentences from the training corpus with the gold POS-tags.Broider

© 2022 - 2024 — McMap. All rights reserved.