I am trying to extract named entities from dutch text. I used nltk-trainer to train a tagger and a chunker on the conll2002 dutch corpus. However, the parse method from the chunker is not detecting any named entities. Here is my code:
str = 'Christiane heeft een lam.'
tagger = nltk.data.load('taggers/dutch.pickle')
chunker = nltk.data.load('chunkers/dutch.pickle')
str_tags = tagger.tag(nltk.word_tokenize(str))
print str_tags
str_chunks = chunker.parse(str_tags)
print str_chunks
And the output of this program:
[('Christiane', u'N'), ('heeft', u'V'), ('een', u'Art'), ('lam', u'Adj'), ('.', u'Punc')]
(S Christiane/N heeft/V een/Art lam/Adj ./Punc)
I was expecting Christiane to be detected as a named entity. Any help?
python train_chunker.py conll2002
. I also triedpython train_chunker.py conll2002 --classifier Maxent
, but, after 40 minutes or so, gotValueError: setting an array element with a sequence.
. How did you train your classifier? – Brookner