SpaCy: Set entity information for a token which is included in more than one span

Asked 8/4, 2021 at 22:0 Answered 19/8, 2021 at 15:16

I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.

I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.

Screenshot of my sample data:

The following is the workflow of my initial code:

Creating patterns and terms


    # Set terms and patterns
    terms = {}
    patterns = []
    for curie, name, category in envoTerms.to_records(index=False):
        if name is not None:
            terms[name.lower()] = {'id': curie, 'category': category}
            patterns.append(nlp(name))

Setup a custom pipeline


    @Language.component('envo_extractor')
    def envo_extractor(doc):
        
        matches = matcher(doc)
        spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
        doc.ents = spans
        
        for i, span in enumerate(spans):
            span._.set("has_envo_ids", True)
            for token in span:
                token._.set("is_envo_term", True)
                token._.set("envo_id", terms[span.text.lower()]["id"])
                token._.set("category", terms[span.text.lower()]["category"])
        
        return doc
    
    # Setter function for doc level
    def has_envo_ids(self, tokens):
        return any([t._.get("is_envo_term") for t in tokens])

##EDIT: #################################################################
    def resolve_substrings(matcher, doc, i, matches):
        # Get the current match and create tuple of entity label, start and end.
        # Append entity to the doc's entity. (Don't overwrite doc.ents!)
        match_id, start, end = matches[i]
        entity = Span(doc, start, end, label="ENVO")
        doc.ents += (entity,)
        print(entity.text)
#########################################################################

Implement the custom pipeline


    nlp = spacy.load("en_core_web_sm")
    matcher = PhraseMatcher(nlp.vocab)
    #### EDIT: Added 'on_match' rule ################################
    matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
    nlp.add_pipe('envo_extractor', after='ner')

and the pipeline looks like this


    [('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
     ('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
     ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
     ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
     ('envo_extractor', <function __main__.envo_extractor(doc)>),
     ('attribute_ruler',
      <spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
     ('lemmatizer',
      <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]

Set extensions


    # Set extensions to tokens, spans and docs
    Token.set_extension('is_envo_term', default=False, force=True)
    Token.set_extension("envo_id", default=False, force=True)
    Token.set_extension("category", default=False, force=True)
    Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
    Doc.set_extension("envo_ids", default=[], force=True)
    Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)

Now when I run the text 'tissue culture', it throws me an error:


    nlp('tissue culture')


    ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.

I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:

Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?

My SpaCy Info:


    ============================== Info about spaCy ==============================
    
    spaCy version    3.0.5                         
    Location         *irrelevant*
    Platform         macOS-10.15.7-x86_64-i386-64bit
    Python version   3.9.2                         
    Pipelines        en_core_web_sm (3.0.0)

Frulla answered 8/4, 2021 at 22:0 Comment(2)

Check this thread. – Keratin 8/4, 2021 at 22:1

I edited my original post to show code edits as per the discussion in the thread you referred to. I still get that error. – Frulla 8/4, 2021 at 23:33

It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:

import spacy
from spacy.util import filter_spans

# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]

pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered

print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)

Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.

Best regards.

Cruck answered 19/8, 2021 at 15:16 Comment(1)

filter_spans definitely did the trick! – Frulla 1/10, 2021 at 20:50

Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.

So you have two options:

Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.

Athirst answered 15/4, 2021 at 16:0 Comment(3)

I am trying to render the doc using displacy.render(doc, style='ent'). If I use, doc.spans, the render will not highlight the entities I'm interested in, correct? – Frulla 16/4, 2021 at 20:8

@Sofie VL I get the following error "attribute 'spans' of 'spacy.tokens.doc.Doc' objects is not writable" when i give the code doc.spans = list(doc.ents) + [span] to update the doc.ents. – Chlamydospore 31/10, 2021 at 9:17

@Salih: the documentation link I cited has some more details, but basically doc.spans is a dictionary grouping sets of spans to keys. So you need to do something like doc.spans["my_spans"] = list(doc.ents) + [span] – Athirst 2/11, 2021 at 9:25

Recommended topics

Hot tags