Named Entity Recognition with Regular Expression: NLTK
Asked Answered
G

3

10

I have been playing with NLTK toolkit. I come across this problem a lot and searched for solution online but nowhere I got a satisfying answer. So I am putting my query here.

Many times NER doesn't tag consecutive NNPs as one NE. I think editing the NER to use RegexpTagger also can improve the NER.

Example:

Input:

Barack Obama is a great person.

Output:

Tree('S', [Tree('PERSON', [('Barack', 'NNP')]), Tree('ORGANIZATION', [('Obama', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('person', 'NN'), ('.', '.')])

where as

input:

Former Vice President Dick Cheney told conservative radio host Laura Ingraham that he "was honored" to be compared to Darth Vader while in office.

Output:

Tree('S', [('Former', 'JJ'), ('Vice', 'NNP'), ('President', 'NNP'), Tree('NE', [('Dick', 'NNP'), ('Cheney', 'NNP')]), ('told', 'VBD'), ('conservative', 'JJ'), ('radio', 'NN'), ('host', 'NN'), Tree('NE', [('Laura', 'NNP'), ('Ingraham', 'NNP')]), ('that', 'IN'), ('he', 'PRP'), ('', ''), ('was', 'VBD'), ('honored', 'VBN'), ("''", "''"), ('to', 'TO'), ('be', 'VB'), ('compared', 'VBN'), ('to', 'TO'), Tree('NE', [('Darth', 'NNP'), ('Vader', 'NNP')]), ('while', 'IN'), ('in', 'IN'), ('office', 'NN'), ('.', '.')])

Here Vice/NNP, President/NNP, (Dick/NNP, Cheney/NNP) , is correctly extracted.

So I think if nltk.ne_chunk is used first and then if two consecutive trees are NNP there are high chances that both refers to one entity.

Any suggestion will be really appreciated. I am looking for flaws in my approach.

Thanks.

Gourmandise answered 25/6, 2014 at 0:45 Comment(0)
B
17
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Barack Obama is a great person." 
print get_continuous_chunks(txt)

[out]:

['Barack Obama']

But do note that if the continuous chunk are not supposed to be a single NE, then you would be combining multiple NEs into one. I can't think of such an example off my head but i'm sure it would happen. But if they not continuous, the script above works fine:

>>> txt = "Barack Obama is the husband of Michelle Obama."  
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']
Brownout answered 25/6, 2014 at 14:8 Comment(5)
Thanks for the pretty code, but do you see any flaw in combining consecutive NNPs to give out one Named Entity.Gourmandise
I can't think of an example off hand but i'm sure there will be continuous NPs that are not supposed to be one NE.Brownout
Thanks for the answer. I think one class of possible examples would contain ditransitive verbs, e.g. "He quoted Michelle Barack Obama", although such cases are certainly quite rare.Amathiste
That sentence is a little odd though ;P perhaps this is more natural "He quoted Michelle and Barack Obama"Brownout
"Did Barack Obama do a great job?" returns "Did Barack Obama". How do you solve it?Intercom
S
6

There is a bug in @alvas's answer. Fencepost error. Make sure to run that elif check outside of the loop as well so that you don't leave off a NE that occurs at the end of the sentence. So:

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue
    if current_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)
            current_chunk = []
    return continuous_chunk

txt = "Barack Obama is a great person and so is Michelle Obama." 
print get_continuous_chunks(txt)
Shear answered 7/12, 2016 at 21:8 Comment(0)
K
0

@alvas great answer. It was really helpful. I have tried to capture your solution in a more functional way. Still have to improve on it though.

    def conditions(tree_node):
    return tree_node.height() == 2

    def coninuous_entities(self, input_text, file_handle):
      from nltk import ne_chunk, pos_tag, word_tokenize
      from nltk.tree import Tree

      # Note: Currently, the chunker categorizes only 2 'NNP' together.  
      docs = input_text.split('\n')
      for input_text in docs:
          chunked_data = ne_chunk(pos_tag(word_tokenize(input_text)))
          child_data = [subtree for subtree in chunked_data.subtrees(filter = self.filter_conditions)]

          named_entities = []
          for child in child_data:
              if type(child) == Tree:
                  named_entities.append(" ".join([token for token, pos in child.leaves()]))

          # Dump all entities to file for now, we will see how to go about that
          if file_handle is not None:
              file_handle.write('\n'.join(named_entities) + '\n')
      return named_entities

Using conditions function one can add many conditions to filter.

Kugler answered 27/4, 2016 at 0:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.