Force spaCy lemmas to be lowercase
Asked Answered
B

1

5

Is it possible to leave the token text true cased, but force the lemmas to be lowercased? I am interested in this because I want to use the PhraseMatcher where I run an input text through the pipleline, and then search for matching phrases on that text, where each search query can be case sensitive or not. In the case that I search by Lemma, i'd like the search to be case insensitive by default.

e.g.

doc = nlp(text)

for query in queries:
    if case1:
         attr = "LEMMA"
    elif case2:
         attr = "ORTH"
    elif case3:
         attr = "LOWER"
    phrase_matcher = PhraseMatcher(self.vocab, attr=attr)
    phrase_matcher.add(key, query)
    matches = phrase_matcher(doc)

In case 1, I expect matching to be case insensitive, and if there were something in the spaCy library to enforce that lemmas are lowercased by default, this would be much more efficient than keeping multiple versions of the doc, and forcing one to have all lowercased characters.

Babineaux answered 9/11, 2020 at 20:23 Comment(0)
A
7

This part of spacy is changing from version to version, last time I looked at the lemmatization was a few versions ago. So this solution might not be the most elegant one, but it is definitely a simple one:

# Create a pipe that converts lemmas to lower case:
def lower_case_lemmas(doc) :
    for token in doc :
        token.lemma_ = token.lemma_.lower()
    return doc

# Add it to the pipeline
nlp.add_pipe(lower_case_lemmas, name="lower_case_lemmas", after="tagger")

You will need to figure out where in the pipeline to add it to. The latest documentation mentions that the Lemmatizer uses POS tagging info, so I am not sure at what point it is called. Placing your pipe after tagger is safe, all the lemmas should be figured out by then.

Another option I can think of is to derive a custom lemmatizer from Lemmatizer class and override its __call__ method, but this is likely to be quite invasive as you will need to figure out how (and where) to plug in your own lemmatizer.

Alfreda answered 10/11, 2020 at 13:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.