Chunking Stanford Named Entity Recognizer (NER) outputs from NLTK format
Asked Answered
M

4

11

I am using NER in NLTK to find persons, locations, and organizations in sentences. I am able to produce the results like this:

[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]

Is that possible to chunk things together by using it? What I want is like this:

u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION'

Thanks!

Mafala answered 23/12, 2014 at 22:46 Comment(0)
C
2

You can use the standard NLTK way of representing chunks using nltk.Tree. This might mean that you have to change your representation a bit.

What I usually do is represent NER-tagged sentences as lists of triplets:

sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]

I do this when I use an external tool for NER tagging a sentence. Now you can transform this sentence the NLTK representation:

from nltk import Tree


def IOB_to_tree(iob_tagged):
    root = Tree('S', [])
    for token in iob_tagged:
        if token[2] == 'O':
            root.append((token[0], token[1]))
        else:
            try:
                if root[-1].label() == token[2]:
                    root[-1].append((token[0], token[1]))
                else:
                    root.append(Tree(token[2], [(token[0], token[1])]))
            except:
                root.append(Tree(token[2], [(token[0], token[1])]))

    return root


sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
print IOB_to_tree(sentence)

The change in representation kind of makes sense because you certainly need POS tags for NER tagging.

The end result should look like:

(S
  (PERSON Andrew/NNP)
  is/VBZ
  part/NN
  of/IN
  the/DT
  (ORGANIZATION Republican/NNP Party/NNP)
  in/IN
  (LOCATION Dallas/NNP))
Caustic answered 24/12, 2014 at 12:45 Comment(0)
C
8

It looks long but it does the work:

ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
chunked, pos = [], ""
for i, word_pos in enumerate(ner_output):
    word, pos = word_pos
    if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
        chunked[-1]+=word_pos
    else:
        chunked.append(word_pos)
    prev_tag = pos

clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) if len(wordpos)!=2 else wordpos for wordpos in chunked]

print clean_chunked

[out]:

[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican Party', u'ORGANIZATION')]

For more details:

The first for-loop "with memory" achieves something like this:

[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')]

You'll realize that all Name Enitties will have more than 2 items in a tuple and what you want are the words as the elements in the list, i.e. 'Republican Party' in (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION'), so you'll do something like this to get the even elements:

>>> x = [0,1,2,3,4,5,6]
>>> x[::2]
[0, 2, 4, 6]
>>> x[1::2]
[1, 3, 5]

Then you also realized that the last element in the NE tuple is the tag you want, so you would do `

>>> x = (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
>>> x[::2]
(u'Republican', u'Party')
>>> x[-1]
u'ORGANIZATION'

It's a little ad-hoc and vebose but I hope it helps. And here it is in a function, Blessed Christmas:

ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]


def rechunk(ner_output):
    chunked, pos = [], ""
    for i, word_pos in enumerate(ner_output):
        word, pos = word_pos
        if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
            chunked[-1]+=word_pos
        else:
            chunked.append(word_pos)
        prev_tag = pos


    clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) 
                    if len(wordpos)!=2 else wordpos for wordpos in chunked]

    return clean_chunked


print rechunk(ner_output)
Coessential answered 23/12, 2014 at 23:17 Comment(4)
I changed chunked, pos = [], "" to chunked, pos, prev_tag = [], "", None, which I think makes more sence. :) But still this is a little awkward in dealing with two consecutive entities, for example: Person Person O. Thanks very much.Mafala
consecutive NEs are rare and i'm finding them for some other work, if you find them, could you help post an example or two? =)Coessential
@Coessential say you have two comma separated names.." PER1_NAME, PER2_NAME and someone else are good friends"..?Trice
the comma will come in between the NE tag. And extracting consecutive NNP tags will still work.Coessential
S
3

This is actually coming in the next release of CoreNLP, under the name MentionsAnnotator. It likely won't be directly available from NLTK, though, unless the NLTK people wish to support it along with the standard Stanford NER interface.

In any case, for the moment you'll have to copy the code I've linked to (which uses LabeledChunkIdentifier for the dirty work) or write your own postprocessor in Python.

Shoat answered 23/12, 2014 at 23:14 Comment(0)
C
2

You can use the standard NLTK way of representing chunks using nltk.Tree. This might mean that you have to change your representation a bit.

What I usually do is represent NER-tagged sentences as lists of triplets:

sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]

I do this when I use an external tool for NER tagging a sentence. Now you can transform this sentence the NLTK representation:

from nltk import Tree


def IOB_to_tree(iob_tagged):
    root = Tree('S', [])
    for token in iob_tagged:
        if token[2] == 'O':
            root.append((token[0], token[1]))
        else:
            try:
                if root[-1].label() == token[2]:
                    root[-1].append((token[0], token[1]))
                else:
                    root.append(Tree(token[2], [(token[0], token[1])]))
            except:
                root.append(Tree(token[2], [(token[0], token[1])]))

    return root


sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
print IOB_to_tree(sentence)

The change in representation kind of makes sense because you certainly need POS tags for NER tagging.

The end result should look like:

(S
  (PERSON Andrew/NNP)
  is/VBZ
  part/NN
  of/IN
  the/DT
  (ORGANIZATION Republican/NNP Party/NNP)
  in/IN
  (LOCATION Dallas/NNP))
Caustic answered 24/12, 2014 at 12:45 Comment(0)
C
1

Here is another short implementation for grouping the Stanford NER results using the groupby iterator of itertools:

def grouptags(tags, ignore="O", join=" "):
    from itertools import groupby
    for c,g in groupby(tags, lambda t: t[1]):
        if ignore is None or c != ignore:
            if join is None:
                entity = [e for e,_ in g]
            else:
                entity = join.join(e for e,_ in g)
            yield(c, entity)

The function grouptags has two options:

  • ignore: specify class that is ignored and omitted from output (default: "O"). If None, all entities are returned.
  • join: specify character used for joining the parts (default: " "). If None, the parts are returned unjoined as a list.
Chau answered 14/6, 2016 at 8:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.