Replace entity with its label in SpaCy
Asked Answered
U

3

15

Is there anyway by SpaCy to replace entity detected by SpaCy NER with its label? For example: I am eating an apple while playing with my Apple Macbook.

I have trained NER model with SpaCy to detect "FRUITS" entity and the model successfully detects the first "apple" as "FRUITS", but not the second "Apple".

I want to do post-processing of my data by replacing each entity with its label, so I want to replace the first "apple" with "FRUITS". The sentence will be "I am eating an FRUITS while playing with my Apple Macbook."

If I simply use regex, it will replace the second "Apple" with "FRUITS" as well, which is incorrect. Is there any smart way to do this?

Thanks!

Untruthful answered 5/11, 2019 at 13:31 Comment(1)
Please post your code!Saree
G
22

the entity label is an attribute of the token (see here)

import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')

s = "His friend Nicolas is here."
doc = nlp(s)

print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
# ['His', 'friend', 'PERSON', 'is', 'here', '.']

print(" ".join([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) )
# His friend PERSON is here .

Edit:

In order to handle cases were entities can span several words the following code can be used instead:

s = "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
newString = s
for e in reversed(doc.ents): #reversed to not modify the offsets of other entities when substituting
    start = e.start_char
    end = start + len(e.text)
    newString = newString[:start] + e.label_ + newString[end:]
print(newString)
#His friend PERSON is here with PERSON and PERSON.

Update:

Jinhua Wang brought to my attention that there is now a more built-in and simpler way to do this using the merge_entities pipe. See Jinhua's answer below.

Gona answered 5/11, 2019 at 15:10 Comment(4)
Thanks! Anyway, how can I make it not duplicating if the entity text is a phrase? For example: "His friend Nicolas Blunt is here." I need to make it "His friend PERSON is here." instead of "His friend PERSON PERSON is here.". Thanks!Untruthful
I added an edit to handle this case were entities can span several words. Hope that helps!Gona
This solution is amazing!Oilcan
@Gona see my solution below for an update.Oilcan
O
6

A more elegant modification to @DBaker's solution above when entities can span several words:

import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe("merge_entities")

s = "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)

print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
# ['His', 'friend', 'PERSON', 'is', 'here', 'with', 'PERSON', 'and', 'PERSON', '.']

print(" ".join([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) )
# His friend PERSON is here with PERSON and PERSON .

You can check the documentation on Spacy here. It uses the built in Pipeline for the job and has good support for multiprocessing. I believe this is the officially supported way to replace entities by their tags.

Oilcan answered 16/7, 2021 at 6:22 Comment(3)
Sometimes it is printing number why? print([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) gives something like: [A, quick, 1501522819326771872, fox].Concinnate
t.ent_type_ prints numbers like this when xx_ent_wiki_sm model is used.Concinnate
It is splitting even based on punctuation. Like: "e-mail" is becoming "e", "-", "mail". Again "What's" is becoming "What", "'s" etc. How to turn this off?Concinnate
C
0

A slightly shorter version of @DBaker answer which uses end_char instead of computing it:

for ent in reversed(doc.ents):
    text = text[:ent.start_char] + ent.label_ + text[ent.end_char:]
Calvados answered 28/1, 2021 at 15:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.