How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?
Asked Answered
V

2

15

I need to extract item combination from 2 lists by means of python Spacy Matcher. The problem is following: Let us have 2 lists:

colors=['red','bright red','black','brown','dark brown']
animals=['fox','bear','hare','squirrel','wolf']

I match the sequences by the following code:

first_color=[]
last_color=[]
only_first_color=[]
for color in colors:
    if ' ' in color:
        first_color.append(color.split(' ')[0])
        last_color.append(color.split(' ')[1])
    else:
        only_first_color.append(color)
matcher = Matcher(nlp.vocab)

pattern1 = [{"TEXT": {"IN": only_first_color}},{"TEXT":{"IN": animals}}]
pattern2 = [{"TEXT": {"IN": first_color}},{"TEXT": {"IN": last_color}},{"TEXT":{"IN": animals}}]

matcher.add("ANIMALS", None, pattern1,pattern2)

doc = nlp('bright red fox met black wolf')

matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(start, end, span.text)

It gives the output:

0 3 bright red fox
1 3 red fox
4 6 black wolf

How can i extract only 'bright red fox' and 'black wolf'? Should i change the patterns rules or post-process the matches?

Any thoughts appreciate!

Viscose answered 7/8, 2020 at 12:37 Comment(2)
What is title?Botanomancy
Sorry, it was from old code version. Now edited.Viscose
T
20

You may use spacy.util.filter_spans:

Filter a sequence of Span objects and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with Retokenizer.merge. When spans overlap, the (first) longest span is preferred over shorter spans.

Python code:

matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
for span in spacy.util.filter_spans(spans):
    print(span.start, span.end, span.text)

Output:

0 3 bright red fox
4 6 black wolf
Tother answered 7/8, 2020 at 14:7 Comment(2)
thanks a lot! It's really elegant solution for the matter. I'm new with spacy and I didn't know about the possibilities of spacy.util.filter_spans. It worked for me very good.Viscose
In case anyone besides me wants to eliminate nested spans but keep overlapping spans: you can modify the code of filter_spans to use an or instead of and in the if statement.Delaminate
C
5

As of spaCy 3.0, the Matcher class now has a greedy filter.

matcher.add("ANIMALS", [pattern1,pattern2], greedy="LONGEST")

When added to matcher object, this returns

0 3 bright red fox
4 6 black wolf

See: https://spacy.io/api/matcher#add

Craig answered 1/5, 2023 at 10:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.