Longest match only with Spacy Phrasematcher
Asked Answered
N

1

5

I have created a Spacy Phrasematcher to match names in a document, following the tutorial. I want to use the resulting matches as additional training data in order to train a Spacy NER model. My patterns, however, contain both full names (e.g. 'Barack Obama') and last names ('Obama') separately.

Hence, in a sentence that contains 'Barack Obama', both patterns match, resulting in overlapping matches. This overlap, however, triggers an exception when I try to use the data for training, e.g.:

ValueError: [E103] Trying to set conflicting doc.ents: '(19, 33, 'PERSON')' and '(29, 33, 'PERSON')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

I've been considering to filter out overlapping matches before using the data for training, but this seems like a very inefficient approach, resulting in a significant increase in processing time for large data.

Is there a way to set up a PhraseMatcher so that it only matches the longest match for overlapping matches?

Nacre answered 29/11, 2019 at 13:1 Comment(0)
T
8

The PhraseMatcher doesn't have a built-in way to filter out overlapping matches while it's matching, but there is a utility function to filter overlapping matches afterwards: spacy.util.filter_spans(). It prefers the longest span and if two overlapping spans are the same length, the earlier span in the text.

Trustful answered 29/11, 2019 at 14:0 Comment(3)
Quick question: What do you mean by: 'the earliest span in the text'?Tassel
The span that starts earlier in the text, so if you have overlapping spans from tokens 3-6 and 5-8, it would prefer the one from 3-6.Trustful
Thanks @aab. It is clear now. So, if two spans have the same length and overlap at the same position, then filter_spans would attached the entity that has been included first in the PhraseMatcherTassel

© 2022 - 2024 — McMap. All rights reserved.