Suppose this is my filecontent
:
When they are over 45 years old!! It would definitely help Michael Jordan.
Below is my code for tagging setences.
st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(filecontent)]
taggedsents = st.tag_sents(tokenized_sents)
I would expect both tokenized_sents
and taggedsents
contain the same number of sentences.
But here is what they contain:
for ts in tokenized_sents:
print "tok ", ts
for ts in taggedsents:
print "tagged ",ts
>> tok ['When', 'they', 'are', 'over', '45', 'years', 'old', '!', '!']
>> tok ['It', 'would', 'definitely', 'help', '.']
>> tagged [(u'When', u'O'), (u'they', u'O'), (u'are', u'O'), (u'over', u'O'), (u'45', u'O'), (u'years', u'O'), (u'old', u'O'), (u'!', u'O')]
>> tagged [(u'!', u'O')]
>> tagged [(u'It', u'O'), (u'would', u'O'), (u'definitely', u'O'), (u'help', u'O'), (u'Michael', u'PERSON'), (u'Jordan', u'PERSON'), (u'.', u'O')]
This is due to having doulbe "!" at the end of the supposed first sentence. Do I have to remove double "!"s before using st.tag_sents()
How should I resolve this?
named entities
in your data. See en.wikipedia.org/wiki/Named-entity_recognition . Try a sentence like 'Michael Jordan went to Apple Inc. to buy and iPad Air for his daugther Layla Jordan' – Verve['!', '!']
to['!!']
, it should work. You're working with noisy data. Stanford tools are built on clean data, so it might not scale to any domain / genre – Verve!
with a null character so that it does not fail. – Devanagari