I have been playing with NLTK toolkit. I come across this problem a lot and searched for solution online but nowhere I got a satisfying answer. So I am putting my query here.
Many times NER doesn't tag consecutive NNPs as one NE. I think editing the NER to use RegexpTagger also can improve the NER.
Example:
Input:
Barack Obama is a great person.
Output:
Tree('S', [Tree('PERSON', [('Barack', 'NNP')]), Tree('ORGANIZATION', [('Obama', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('person', 'NN'), ('.', '.')])
where as
input:
Former Vice President Dick Cheney told conservative radio host Laura Ingraham that he "was honored" to be compared to Darth Vader while in office.
Output:
Tree('S', [('Former', 'JJ'), ('Vice', 'NNP'), ('President', 'NNP'), Tree('NE', [('Dick', 'NNP'), ('Cheney', 'NNP')]), ('told', 'VBD'), ('conservative', 'JJ'), ('radio', 'NN'), ('host', 'NN'), Tree('NE', [('Laura', 'NNP'), ('Ingraham', 'NNP')]), ('that', 'IN'), ('he', 'PRP'), ('
', '
'), ('was', 'VBD'), ('honored', 'VBN'), ("''", "''"), ('to', 'TO'), ('be', 'VB'), ('compared', 'VBN'), ('to', 'TO'), Tree('NE', [('Darth', 'NNP'), ('Vader', 'NNP')]), ('while', 'IN'), ('in', 'IN'), ('office', 'NN'), ('.', '.')])
Here Vice/NNP, President/NNP, (Dick/NNP, Cheney/NNP) , is correctly extracted.
So I think if nltk.ne_chunk is used first and then if two consecutive trees are NNP there are high chances that both refers to one entity.
Any suggestion will be really appreciated. I am looking for flaws in my approach.
Thanks.