Force spaCy lemmas to be lowercase

doc = nlp(text) for query in queries: if case1: attr = "LEMMA" elif case2: attr = "ORTH" elif case3: attr = "LOWER" phrase_matcher = PhraseMatcher(self.vocab, attr=attr) phrase_matcher.add(key, query) matches = phrase_matcher(doc)

This part of spacy is changing from version to version, last time I looked at the lemmatization was a few versions ago. So this solution might not be the most elegant one, but it is definitely a simple one:

# Create a pipe that converts lemmas to lower case:
def lower_case_lemmas(doc) :
    for token in doc :
        token.lemma_ = token.lemma_.lower()
    return doc

# Add it to the pipeline
nlp.add_pipe(lower_case_lemmas, name="lower_case_lemmas", after="tagger")

You will need to figure out where in the pipeline to add it to. The latest documentation mentions that the Lemmatizer uses POS tagging info, so I am not sure at what point it is called. Placing your pipe after tagger is safe, all the lemmas should be figured out by then.

Another option I can think of is to derive a custom lemmatizer from Lemmatizer class and override its __call__ method, but this is likely to be quite invasive as you will need to figure out how (and where) to plug in your own lemmatizer.

Recommended topics

Hot tags