nltk StanfordNERTagger : How to get proper nouns without capitalization

Asked 23/12, 2015 at 15:47 Answered 24/12, 2015 at 21:44

Solved python nlp nltk stanford-nlp pos-tagger

I am trying to use the StanfordNERTagger and nltk to extract keywords from a piece of text.

docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics."

words = re.split("\W+",docText) 

stops = set(stopwords.words("english"))

    #remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]

str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']

print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged

this gives me

John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics
Stanford POS Tagged
[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term']
[(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]

so clearly, things like Short and Term were tagged as NNP. The data that i have contains many such instances where non NNP words are capitalized. This might be due to typos or maybe they are headers. I dont have much control over that.

How can i parse or clean up the data so that i can detect a non NNP term even though it may be capitalized? I dont want terms like Short and Term to be categorized as NNP

Also, not sure why John Donk was captured as a person but Brian Jones was not. Could it be due to the other capitalized non NNPs in my data? Could that be having an effect on how the StanfordNERTagger treats everything else?

Update, one possible solution

Here is what i plan to do

Take each word and convert to lower case
Tag the lowercase word
If the tag is NNP then we know that the original word must also be an NNP
If not, then the original word was mis-capitalized

Here is what i tried to do

str = " ".join(words)
print str
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 
for word in str.split():
    wl = word.lower()
    print wl
    w,pos = stp.tag(wl)
    print pos
    if pos=="NNP":
        print "Got NNP"
        print w

but this gives me error

John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics
john
Traceback (most recent call last):
  File "X:\crp.py", line 37, in <module>
    w,pos = stp.tag(wl)
ValueError: too many values to unpack

i have tried multiple approaches but some error always shows up. How can i Tag a single word?

I dont want to convert the whole string to lower case and then Tag. If i do that, the StanfordPOSTagger returns an empty string

Charron answered 23/12, 2015 at 15:47 Comment(0)

Firstly, see your other question to setup Stanford CoreNLP to be called from command-line or python: nltk : How to prevent stemming of proper nouns.

For the proper cased sentence we see that the NER works properly:

>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. '
... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics')
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner',  'outputFormat': 'json'})
>>> annotated_sent0 = output['sentences'][0]
>>> annotated_sent1 = output['sentences'][1]
>>> for token in annotated_sent0['tokens']:
...     print token['word'], token['lemma'], token['pos'], token['ner']
... 
John John NNP PERSON
Donk Donk NNP PERSON
works work VBZ O
POI POI NNP ORGANIZATION
Jones Jones NNP ORGANIZATION
wants want VBZ O
meet meet VB O
Xyz Xyz NNP ORGANIZATION
Corp Corp NNP ORGANIZATION
measuring measure VBG O
POI poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
. . . O

And for the lowered cased sentence, you will not get NNP for POS tag nor any NER tag:

>>> for token in annotated_sent1['tokens']:
...     print token['word'], token['lemma'], token['pos'], token['ner']
... 
john john NN O
donk donk JJ O
works work NNS O
poi poi VBP O
jones jone NNS O
wants want VBZ O
meet meet VB O
xyz xyz NN O
corp corp NN O
measuring measure VBG O
poi poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O

So the question to your question should be:

What is the ultimate aim of your NLP application?
Why is your input lower-cased? Was it your doing or how the data was provided?

And after answering those questions, you can move on to decide what you really want to do with the NER tags, i.e.

If the input is lower-cased and it's because of how you structured your NLP tool chain, then
- DO NOT do that!!! Perform the NER on the normal text without distortions you've created. It's because the NER was trained on normal text so it won't really work out of the context of normal text.
- Also try to not mix it NLP tools from different suites they will usually not play nice, especially at the end of your NLP tool chain
If the input is lower-cased because that's how the original data was, then:
- Annotate a small portion of the data, or find annotated data that was lowercased and then retrain a model.
- Work around it and train a truecaser with normal text then apply the truecasing model to the lower-cased text. See https://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf
If the input has erroneous casing, e.g. `Some big Some Small but not all are Proper Noun, then
- Try the truecasing solution too.

Apfel answered 24/12, 2015 at 21:44 Comment(4)

thanks a lot for your help man :) as a follow up, what POS are proper nouns commonly surrounded with in the English language? – Charron 7/1, 2016 at 13:3

From the Penntree Bank tagset: NNP and NNPS (see ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) – Apfel 7/1, 2016 at 13:8

correct, but in a given piece of text, which POS tags are likely to be around proper nouns? is there such a likelihood? – Charron 7/1, 2016 at 13:15

Google hidden markov model and also go through the NLTK book www.nltk.org/book/. You'll know more after you've reached the last chapter. – Apfel 7/1, 2016 at 13:24

First you should not use predefined keywords in your program as variable names. Avoid using str as a variable name. Instead use newstring or anything else.

In your update you are passing each lower case word to the POS tagger. the tag() method splits each string passed to it and gives POS tagging for each character.

So i suggest you pass a list rather than a word to the tag() method. The list will contain only one word at a time.

You can try it like: w = stp.tag([wl]) w will be a list with two items [w1,POS]

In this way you can tag a single word

But in this case it gives POS tag of john as NN

Moneyer answered 24/12, 2015 at 7:32 Comment(3)

thanks man! but how do i extract the NN? for each word i want to see the POS and do some processing. when i try to print stp.tag([wl.lower()])[1] it says index out of range. index [0] prints both the elements as (u'john', u'NN') – Charron 24/12, 2015 at 14:8

forget about it . i got this x=stp.tag([w.lower()]) y=x[0] print y[1] :) – Charron 24/12, 2015 at 14:11

Just do w[1] You will get the POS of the word. Dont try to do everything in just one line. – Moneyer 25/12, 2015 at 5:32

Recommended topics

Hot tags