Stanford NER with python NLTK fails with strings containing multiple "!!"s? - McMap

About

Stanford NER with python NLTK fails with strings containing multiple "!!"s?

Asked 17/11, 2015 at 10:52 Answered 17/11, 2015 at 16:23

python nltk stanford-nlp named-entity-recognition

S

1

1

Suppose this is my filecontent:

When they are over 45 years old!! It would definitely help Michael Jordan.

Below is my code for tagging setences.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(filecontent)]  
taggedsents = st.tag_sents(tokenized_sents)

I would expect both tokenized_sents and taggedsents contain the same number of sentences.

But here is what they contain:

for ts in tokenized_sents:
    print "tok   ", ts

for ts in taggedsents:
    print "tagged    ",ts

>> tok    ['When', 'they', 'are', 'over', '45', 'years', 'old', '!', '!']
>> tok    ['It', 'would', 'definitely', 'help', '.']
>> tagged     [(u'When', u'O'), (u'they', u'O'), (u'are', u'O'), (u'over', u'O'), (u'45', u'O'), (u'years', u'O'), (u'old', u'O'), (u'!', u'O')]
>> tagged     [(u'!', u'O')]
>> tagged     [(u'It', u'O'), (u'would', u'O'), (u'definitely', u'O'), (u'help', u'O'), (u'Michael', u'PERSON'), (u'Jordan', u'PERSON'), (u'.', u'O')]

This is due to having doulbe "!" at the end of the supposed first sentence. Do I have to remove double "!"s before using st.tag_sents()

How should I resolve this?

Scanderbeg answered 17/11, 2015 at 10:52 Comment(9)

There are no named entities in your data. See en.wikipedia.org/wiki/Named-entity_recognition . Try a sentence like 'Michael Jordan went to Apple Inc. to buy and iPad Air for his daugther Layla Jordan' – Verve 17/11, 2015 at 10:58

The sentence tokenization is a weird thing so if you change ['!', '!'] to ['!!'], it should work. You're working with noisy data. Stanford tools are built on clean data, so it might not scale to any domain / genre – Verve 17/11, 2015 at 11:0

it's not a prob with having no NEs (have added a ne to the string but still the same). – Scanderbeg 17/11, 2015 at 11:2

Yeah so its a weired problem with tokenization. – Scanderbeg 17/11, 2015 at 11:3

no idea why it can't just use the sentences I passed to 'tag_sents()' without further tokenizing! – Scanderbeg 17/11, 2015 at 11:5

follow StanfordNLPHelp instructions if you're not bent on at must to use NLTK, otherwise, it will take some time for an answer as to why the NLTK API don't work as you expect. And yet some more time for NLTK to improve the API such that it keeps the tokenization provided by the user. – Verve 17/11, 2015 at 21:49

You could replace the ! with a null character so that it does not fail. – Devanagari 24/12, 2015 at 5:45

@RohanAmrute yes but then there could be other characters that fails as well – Scanderbeg 24/12, 2015 at 7:57

I think there is no fool-proof way of of doing this. Have to test it by trial and error technique – Devanagari 24/12, 2015 at 8:22

S

1

If you follow my solution from the other question instead of using nltk you will get JSON that properly splits this text into two sentences.

Link to previous question: how to speed up NE recognition with stanford NER with python nltk

Skiplane answered 17/11, 2015 at 16:23 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.