spaCy coreference resolution - named entity recognition (NER) to return unique entity ID's?
Asked Answered
O

1

8

Perhaps I've skipped over a part of the docs, but what I am trying to determine is a unique ID for each entity in the standard NER toolset. For example:

import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()

text = "This is a text about Apple Inc based in San Fransisco. "\
        "And here is some text about Samsung Corp. "\
        "Now, here is some more text about Apple and its products for customers in Norway"

doc = nlp(text)

for ent in doc.ents:
    print('ID:{}\t{}\t"{}"\t'.format(ent.label,ent.label_,ent.text,))


displacy.render(doc, jupyter=True, style='ent')

returns:

ID:381    ORG "Apple Inc" 
ID:382    GPE "San Fransisco" 
ID:381    ORG "Samsung Corp." 
ID:381    ORG "Apple" 
ID:382    GPE "Norway"

I have been looking at ent.ent_id and ent.ent_id_ but these are inactive according to the docs. I couldn't find anything in ent.root either.

For example, in GCP NLP each entity is returned with an ⟨entity⟩number that enables you to identify multiple instances of the same entity within a text.

This is a ⟨text⟩2 about ⟨Apple Inc⟩1 based in ⟨San Fransisco⟩4. And here is some ⟨text⟩3 about ⟨Samsung Corp⟩6. Now, here is some more ⟨text⟩8 about ⟨Apple⟩1 and its ⟨products⟩5 for ⟨customers⟩7 in ⟨Norway⟩9"

Does spaCy support something similar? Or is there a way using NLTK or Stanford?

Organogenesis answered 12/12, 2018 at 19:57 Comment(3)
I don't completely understand what you are looking for. ent.label is the id of the entity type (ORG, PERSON, GPE, etc.). There is no way for spaCy to understand that two names refer to the same entity type instance, if that's your question.Ballentine
Yeah I know ent.label and id refers to ent type. In corenlp, and others, there are ways to do entity linking and or coreference. In the spacy docs it mentions that there is an ent.ent_id_ field but the docs do not describe how to implement or populate that field.Organogenesis
Ok, I see what you are asking for. No, unfortunately, coreference is not supported in spaCy yet.Ballentine
P
4

You can use neuralcoref library to get coreference resolution working with SpaCy's models as:

# Load your usual SpaCy model (one of SpaCy English models)
import spacy
nlp = spacy.load('en')

# Add neural coref to SpaCy's pipe
import neuralcoref
neuralcoref.add_to_pipe(nlp)

# You're done. You can now use NeuralCoref as you usually manipulate a SpaCy document annotations.
doc = nlp(u'My sister has a dog. She loves him.')

doc._.has_coref
doc._.coref_clusters

Find the installation and usage instructions here: https://github.com/huggingface/neuralcoref

Prairie answered 8/7, 2020 at 8:4 Comment(2)
But what the spacy gives are only the token texts...Is there any way to get the entity ids (or token id)as the output of coreference resolver in spacy. As it will help in better identification of coreference relations.Mahratta
@AlapanKuila I did not understand your question well. What do you refer to by "what the spacy gives"? Please specify which module of SpaCy are you referring to here. Also, what do you mean by entity ids or token ids and output of 'coreference resolver'? Which attribute of the neuralcoref are you referring to here? You can find the list of attributes here: github.com/huggingface/neuralcorefPrairie

© 2022 - 2024 — McMap. All rights reserved.