In spacy, Is it possible to get the corresponding rule id in a match of matches

Asked 26/11, 2017 at 7:21 Answered 21/7 at 2:18

In Spacy 2.x, I use the matcher to find specific tokens in my text corpus. Each rule has an ID ('class-1_0' for example). During parse, I use the callback on_match to handle each match. Is there a solution to retrieve the rule used to find the match directly in the callback.

Here is my sample code.

txt = ("Aujourd'hui, je vais me faire une tartine au beurre "
       "de cacahuète, c'est un pilier de ma nourriture "
       "quotidienne.")

nlp = spacy.load('fr')

def on_match(matcher, doc, id, matches):
    span = doc[matches[id][1]:matches[id][2]]
    print(span)
    # find a way to get the corresponding rule without fuzz

matcher = Matcher(nlp.vocab)
matcher.add('class-1_0', on_match, [{'LEMMA': 'pilier'}])
matcher.add('class-1_1', on_match, [{'LEMMA': 'beurre'}, {'LEMMA': 'de'}, {'LEMMA': 'cacahuète'}])

doc = nlp(txt)
matches = matcher(doc)

In this case matches return :

[(12071893341338447867, 9, 12), (4566231695725171773, 16, 17)]

12071893341338447867 is a unique ID based on class-1_0. I cannot find the original rule name, even if I do some introspection in matcher._patterns.

It would be great if someone can help me. Thank you very much.

Davies answered 26/11, 2017 at 7:21 Comment(0)

Yes – you can simply look up the ID in the StringStore of your vocabulary, available via nlp.vocab.strings or doc.vocab.strings. Going via the Doc is pretty convenient here, because you can do so within your on_match callback:

def on_match(matcher, doc, match_id, matches):
   string_id = doc.vocab.strings[match_id]

For efficiency, spaCy encodes all strings to integers and keeps a reference to the mapping in the StringStore lookup table. In spaCy v2.0, the integers are hash values, so they'll always match across models and vocabularies. Fore more details on this, see this section in the docs.

Of course, if your classes and IDs are kinda cryptic anyways, the other answer suggesting integer IDs will work fine, too. Just keep in mind that those integer IDs you choose will likely also be mapped to some random string in the StringStore (like a word, or a part-of-speech tag or something). This usually doesn't matter if you're not looking them up and resolving them to strings somewhere – but if you do, the output may be confusing. For example, if your matcher rule ID is 99 and you're calling doc.vocab.strings[99], this will return 'VERB'.

Innes answered 28/11, 2017 at 5:21 Comment(2)

Thank you. I tested your answer, it points to the right direction. But to get the string ID you need to use the integer encoded match rule, not match_id. string_id = doc.vocab.strings[matches[id][0]] Thanks again. – Davies 29/11, 2017 at 7:29

And thanks for your incredible achievement with spacy 2.0 :) – Davies 29/11, 2017 at 8:11

While writing my question, as often, I found the solution.

It's dead simple, instead of using unicode rule id, like class-1_0, simply use a interger. The identifier will be preserved throughout the process.

matcher.add(1, on_match, [{'LEMMA': 'pilier'}])

Match with

[(1, 16, 17),]

Davies answered 26/11, 2017 at 7:21 Comment(0)

In 2024, this is easier than ever for a few of the Spacy matchers. Spacy has added an argument named as_spans when invoking Matchers which cleanly map to spans (e.g. PhraseMatcher), and the span returned has an instance variable called 'span_' which contains the label as a string, so you can simply do:

for span in matcher(doc, as_spans=True):
    print(f"Matched pattern {span.label_} based on text: {span}")

e.g. if your pattern ID was BANANA when calling matcher.add, then assuming the pattern just matches the string 'banana' your output will look like:

Matched pattern BANANA based on text: banana

Since the question was from an old version of Spacy where this was not easier to do, and the top answer (as I'm posting this) is correct for that version, I'm posting this as a separate answer. No one has to do it the old convoluted way anymore, however, so I feel like this gets to the core of the question better in 2024. :)

Petrifaction answered 21/7 at 2:18 Comment(0)

Recommended topics

Hot tags