Stanford NER with python NLTK fails with strings containing multiple "!!"s?
Asked Answered
S

1

1

Suppose this is my filecontent:

When they are over 45 years old!! It would definitely help Michael Jordan.

Below is my code for tagging setences.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(filecontent)]  
taggedsents = st.tag_sents(tokenized_sents)

I would expect both tokenized_sents and taggedsents contain the same number of sentences.

But here is what they contain:

for ts in tokenized_sents:
    print "tok   ", ts

for ts in taggedsents:
    print "tagged    ",ts

>> tok    ['When', 'they', 'are', 'over', '45', 'years', 'old', '!', '!']
>> tok    ['It', 'would', 'definitely', 'help', '.']
>> tagged     [(u'When', u'O'), (u'they', u'O'), (u'are', u'O'), (u'over', u'O'), (u'45', u'O'), (u'years', u'O'), (u'old', u'O'), (u'!', u'O')]
>> tagged     [(u'!', u'O')]
>> tagged     [(u'It', u'O'), (u'would', u'O'), (u'definitely', u'O'), (u'help', u'O'), (u'Michael', u'PERSON'), (u'Jordan', u'PERSON'), (u'.', u'O')]

This is due to having doulbe "!" at the end of the supposed first sentence. Do I have to remove double "!"s before using st.tag_sents()

How should I resolve this?

Scanderbeg answered 17/11, 2015 at 10:52 Comment(9)
There are no named entities in your data. See en.wikipedia.org/wiki/Named-entity_recognition . Try a sentence like 'Michael Jordan went to Apple Inc. to buy and iPad Air for his daugther Layla Jordan'Verve
The sentence tokenization is a weird thing so if you change ['!', '!'] to ['!!'], it should work. You're working with noisy data. Stanford tools are built on clean data, so it might not scale to any domain / genreVerve
it's not a prob with having no NEs (have added a ne to the string but still the same).Scanderbeg
Yeah so its a weired problem with tokenization.Scanderbeg
no idea why it can't just use the sentences I passed to 'tag_sents()' without further tokenizing!Scanderbeg
follow StanfordNLPHelp instructions if you're not bent on at must to use NLTK, otherwise, it will take some time for an answer as to why the NLTK API don't work as you expect. And yet some more time for NLTK to improve the API such that it keeps the tokenization provided by the user.Verve
You could replace the ! with a null character so that it does not fail.Devanagari
@RohanAmrute yes but then there could be other characters that fails as wellScanderbeg
I think there is no fool-proof way of of doing this. Have to test it by trial and error techniqueDevanagari
S
1

If you follow my solution from the other question instead of using nltk you will get JSON that properly splits this text into two sentences.

Link to previous question: how to speed up NE recognition with stanford NER with python nltk

Skiplane answered 17/11, 2015 at 16:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.