Retrieve the span of an entity from one of its tokens in spaCy
Asked Answered
S

2

6

Given a token which is part of a named entity with multiple tokens, is there a direct method to get the span of that entity?

For example, consider this sentence with one two-word named entity:

>>> doc = nlp("This year was amazing.")
>>> doc.ents
(This year,)
>>> doc[0].ent_type_
'DATE'
>>> doc[1].ent_type_
'DATE'

Let's say we consider the first token ("This"), is it possible to retrieve the entity that its part of? Maybe something like this:

>>> doc[0].ents_
(This year,)

I guess that sometimes a token can be part of more than one entity.

At the moment, I'm obtaining this by creating a reverse dictionary from indices to entity indices.

Thanks!

Skive answered 20/4, 2019 at 4:51 Comment(0)
A
6

You can iterate over the doc.ents and then merge them into a single token, as Named entities are Span objects. spaCy also ships with a handy component you can plug into your pipeline that takes care of this automatically:

from spacy.pipeline import merge_entities    
nlp = spacy.load("en_core_web_sm")  # or any other model
nlp.add_pipe(merge_entities)
print([token.text for token in nlp("John Murphy lives in New York City")])
# ['John Murphy', 'lives', 'in', 'New York City'] 
Aztec answered 16/8, 2019 at 18:3 Comment(2)
Thank you! It doesn't provide a solution to what I was looking for, as I still want to preserve the original tokens. Nonetheless, it is very useful.Skive
This was amazing !!Waterway
A
3

I think this is what you want:

def get_ent_from_token(token):
    return [ent for ent in doc.ents 
            if ent.start_char <= token.idx <= ent.end_char][0]

P.S. I hope that moving forward the spaCy library will include more such basic utilities for converting back and forth between spans, tokens, entities, character offsets, token offsets, etc. I tend to waste a lot of time fussing with that kind of thing.

Armillas answered 10/1, 2020 at 20:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.