In Spacy 2.x, I use the matcher to find specific tokens in my text corpus. Each rule has an ID ('class-1_0'
for example). During parse, I use the callback on_match
to handle each match. Is there a solution to retrieve the rule used to find the match directly in the callback.
Here is my sample code.
txt = ("Aujourd'hui, je vais me faire une tartine au beurre "
"de cacahuète, c'est un pilier de ma nourriture "
"quotidienne.")
nlp = spacy.load('fr')
def on_match(matcher, doc, id, matches):
span = doc[matches[id][1]:matches[id][2]]
print(span)
# find a way to get the corresponding rule without fuzz
matcher = Matcher(nlp.vocab)
matcher.add('class-1_0', on_match, [{'LEMMA': 'pilier'}])
matcher.add('class-1_1', on_match, [{'LEMMA': 'beurre'}, {'LEMMA': 'de'}, {'LEMMA': 'cacahuète'}])
doc = nlp(txt)
matches = matcher(doc)
In this case matches
return :
[(12071893341338447867, 9, 12), (4566231695725171773, 16, 17)]
12071893341338447867
is a unique ID based on class-1_0
. I cannot find the original rule name, even if I do some introspection in matcher._patterns
.
It would be great if someone can help me. Thank you very much.
match_id
.string_id = doc.vocab.strings[matches[id][0]]
Thanks again. – Davies