Extract list of Persons and Organizations using Stanford NER Tagger in NLTK

Asked 5/6, 2015 at 10:49 Answered 13/10, 2018 at 10:9

Solved python nltk stanford-nlp named-entity-recognition

I am trying to extract list of persons and organizations using Stanford Named Entity Recognizer (NER) in Python NLTK. When I run:

from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
               '/usr/share/stanford-ner/stanford-ner.jar') 
r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
print(r)

the output is:

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

what I want is to extract from this list all persons and organizations in this form:

Rami Eid
Sony Brook University

I tried to loop through the list of tuples:

for x,y in i:
        if y == 'ORGANIZATION':
            print(x)

But this code only prints every entity one per line:

Sony 
Brook 
University

With real data there can be more than one organizations, persons in one sentence, how can I put the limits between different entities?

Mimamsa answered 5/6, 2015 at 10:49 Comment(4)

This may help – Friarbird 5/6, 2015 at 10:54

I am working with python (NLTK), not Java. But maybe this can help me to work around. – Mimamsa 5/6, 2015 at 11:2

Thanks for this valuable answer , but currently i looking for how to add words to PERSON entity – Padraig 29/8, 2017 at 11:48

#36668840 Try this – Rumal 7/5, 2019 at 5:48

Thanks to the link discovered by @Vaulstein, it is clear that the trained Stanford tagger, as distributed (at least in 2012) does not chunk named entities. From the accepted answer:

Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)

You have the following options:

Collect runs of identically tagged words; e.g., all adjacent words tagged PERSON should be taken together as one named entity. That's very easy, but of course it will sometimes combine different named entities. (E.g. New York, Boston [and] Baltimore is about three cities, not one.) Edit: This is what Alvas's code does in the accepted anwser. See below for a simpler implementation.
Use nltk.ne_chunk(). It doesn't use the Stanford recognizer but it does chunk entities. (It's a wrapper around an IOB named entity tagger).
Figure out a way to do your own chunking on top of the results that the Stanford tagger returns.
Train your own IOB named entity chunker (using the Stanford tools, or the NLTK's framework) for the domain you are interested in. If you have the time and resources to do this right, it will probably give you the best results.

Edit: If all you want is to pull out runs of continuous named entities (option 1 above), you should use itertools.groupby:

from itertools import groupby
for tag, chunk in groupby(netagged_words, lambda x:x[1]):
    if tag != "O":
        print("%-12s"%tag, " ".join(w for w, t in chunk))

If netagged_words is the list of (word, type) tuples in your question, this produces:

PERSON       Rami Eid
ORGANIZATION Stony Brook University
LOCATION     NY

Note again that if two named entities of the same type occur right next to each other, this approach will combine them. E.g. New York, Boston [and] Baltimore is about three cities, not one.

Tameratamerlane answered 5/6, 2015 at 11:7 Comment(6)

I tried nltk chunker, but t didn't give the best results. Stanford recognizer gives very good results, but there is this problem with multi terms entities that I have to solve. – Mimamsa 5/6, 2015 at 11:11

I understand that; otherwise I wouldn't have bothered mentioning the other alternatives. – Tameratamerlane 5/6, 2015 at 11:12

@Tameratamerlane What does this "%-12s"% mean? Is this regex? – Zambia 7/11, 2018 at 19:55

@sharp, no it's for aligning words of different lengths. See docs.python.org/3/library/stdtypes.html#old-string-formatting – Tameratamerlane 7/11, 2018 at 23:37

AttributeError: module 'nltk' has no attribute 'ne_recognize' – Bul 14/6, 2022 at 7:53

@DS_ShraShetty, looks like the nltk interface has changed in the past seven years, what a shock. ne_recognize has been renamed to ne_chunk, thanks for the heads-up. – Tameratamerlane 15/6, 2022 at 0:1

IOB/BIO means Inside, Outside, Beginning (IOB), or sometimes aka Beginning, Inside, Outside (BIO)

The Stanford NE tagger returns IOB/BIO style tags, e.g.

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

The ('Rami', 'PERSON'), ('Eid', 'PERSON') are tagged as PERSON and "Rami" is the Beginning or a NE chunk and "Eid" is the inside. And then you see that any non-NE will be tagged with "O".

The idea to extract continuous NE chunk is very similar to Named Entity Recognition with Regular Expression: NLTK but because the Stanford NE chunker API doesn't return a nice tree to parse, you have to do this:

def get_continuous_chunks(tagged_sent):
    continuous_chunk = []
    current_chunk = []

    for token, tag in tagged_sent:
        if tag != "O":
            current_chunk.append((token, tag))
        else:
            if current_chunk: # if the current chunk is not empty
                continuous_chunk.append(current_chunk)
                current_chunk = []
    # Flush the final current_chunk into the continuous_chunk, if any.
    if current_chunk:
        continuous_chunk.append(current_chunk)
    return continuous_chunk

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]

print named_entities
print
print named_entities_str
print
print named_entities_str_tag
print

[out]:

[[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]

['Rami Eid', 'Stony Brook University', 'NY']

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

But please note the limitation that if two NEs are continuous, then it might be wrong, nevertheless i still can't think of any example where two NEs are continuous without any "O" between them.

As @alexis suggested, it's better to convert the stanford NE output into NLTK trees:

from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree

def stanfordNE2BIO(tagged_sent):
    bio_tagged_sent = []
    prev_tag = "O"
    for token, tag in tagged_sent:
        if tag == "O": #O
            bio_tagged_sent.append((token, tag))
            prev_tag = tag
            continue
        if tag != "O" and prev_tag == "O": # Begin NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag == tag: # Inside NE
            bio_tagged_sent.append((token, "I-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag

    return bio_tagged_sent


def stanfordNE2tree(ne_tagged_sent):
    bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
    sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
    sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]

    sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
    ne_tree = conlltags2tree(sent_conlltags)
    return ne_tree

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), 
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), 
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), 
('in', 'O'), ('NY', 'LOCATION')]

ne_tree = stanfordNE2tree(ne_tagged_sent)

print ne_tree

[out]:

  (S
  (PERSON Rami/NNP Eid/NNP)
  is/VBZ
  studying/VBG
  at/IN
  (ORGANIZATION Stony/NNP Brook/NNP University/NNP)
  in/IN
  (LOCATION NY/NNP))

Then:

ne_in_sent = []
for subtree in ne_tree:
    if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
        ne_label = subtree.label()
        ne_string = " ".join([token for token, pos in subtree.leaves()])
        ne_in_sent.append((ne_string, ne_label))
print ne_in_sent

[out]:

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

Meticulous answered 5/6, 2015 at 12:47 Comment(10)

What if I have list of sentences, is it better to call the function for every sentence or to redefine the function? – Mimamsa 5/6, 2015 at 13:5

It's your choice. As long as you get the idea of how BIO works and how to extract BIs , it shouldn't be hard to get NEs from the Stanford outputs in any form ;) – Meticulous 5/6, 2015 at 13:8

Nice explanation of IOB, but this does not "work perfectly." It will over-chunk if there are two adjacent NEs of the same type (see "option 1" in my answer). From the question I thought that the OP understands this. – Tameratamerlane 5/6, 2015 at 14:11

The right way to do this in the nltk context is to change the tags to IOB ("B-PERSON" etc.), then use conlltags2tree() to turn them into a tree (with sublists for each named entity.) – Tameratamerlane 5/6, 2015 at 14:40

@Tameratamerlane the question that i've always been asking is whether there exists any NEs next to each other. I've been asking around and it seems like if NEs are adjacent, wouldn't it normally form a bigger NE. I've been figuring it out for quite some time but i haven't empirically tried anything. – Meticulous 5/6, 2015 at 14:52

nevertheless, the updated answer uses conlltags2tree() – Meticulous 5/6, 2015 at 15:39

About adjacent NEs: the whole point of inventing the IOB format is that it does happen. The CONLL2002 corpus has a bunch of them. (Mostly adjacent entites of the MISC type, but there are a few other types.) – Tameratamerlane 5/6, 2015 at 15:57

@Tameratamerlane thanks for the tip on CONLL2002. I took a look and there're examples like "The liberal minister of justice Marc Verwilghen is not a candidate on the local list of ..." where 'liberal minister of justice' and 'Marc Verwilghen' are adjacent. Interesting!! – Meticulous 5/6, 2015 at 16:6

Well, to be fair in that one both terms refer to the same person. But e.g. in "Mary Shelley's Frankenstein", there should be two NEs. – Tameratamerlane 5/6, 2015 at 16:16

@Tameratamerlane I believe that the 's in Mary Shelley's will act as a delimiter. – Exsanguine 23/4, 2017 at 14:33

Not exactly as per the topic author requirement to print what he wants, maybe this can be of any help,

listx = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]


def parser(n, string):
    for i in listx[n]:
        if i == string:
            pass
        else:
            return i

name = parser(0,'PERSON')
lname = parser(1,'PERSON')
org1 = parser(5,'ORGANIZATION')
org2 = parser(6,'ORGANIZATION')
org3 = parser(7,'ORGANIZATION')


print name, lname
print org1, org2, org3

Output would be something like this

Rami Eid
Stony Brook University

Well answered 13/6, 2015 at 16:54 Comment(2)

good try but i don't think this is the case when you want to extract named entities out of an annotated data from Stanford NLP. – Meticulous 13/6, 2015 at 17:50

aah my bad, if the annotations are in string (eg. organization or person etc), this should produce the same output; name = parser(0,None) – Well 13/6, 2015 at 18:17

WARNING: Even if u get this model "all.3class.distsim.crf.ser.gz" please dont use it because

1st reason :

For this model stanford nlp people have openly appologized for bad accuracy

2nd reason :

It has bad accuracy becase it is case sensitive .

SOLUTION

use the model called "english.all.3class.caseless.distsim.crf.ser.gz"

Wavelet answered 13/10, 2018 at 10:9 Comment(0)

Use pycorenlp wrapper from python and then use 'entitymentions' as a key to get the continuous chunk of person or organization in a single string.

Eicher answered 19/6, 2018 at 11:57 Comment(0)

Try using the "enumerate" method.

When you apply NER to the list of words, once tuples are created of (word,type), enumerate this list using the enumerate(list). This would assign an index to every tuple in the list.

So later, when you extract PERSON/ORGANISATION/LOCATION from the list they would have an index attached to it.

1   Hussein
2   Obama
3   II
6   James
7   Naismith
21   Naismith
19   Tony
20   Hinkle
0   Frank
1   Mahan
14   Naismith
0   Naismith
0   Mahan
0   Mahan
0   Naismith

Now on the basis of the consecutive index a single name can be filtered out.

Hussein Obama II, James Naismith, Tony Hank, Frank Mahan

Misericord answered 9/10, 2018 at 12:55 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags