Is it possible to leave the token text true cased, but force the lemmas to be lowercased? I am interested in this because I want to use the PhraseMatcher
where I run an input text through the pipleline, and then search for matching phrases on that text, where each search query can be case sensitive or not. In the case that I search by Lemma, i'd like the search to be case insensitive by default.
e.g.
doc = nlp(text)
for query in queries:
if case1:
attr = "LEMMA"
elif case2:
attr = "ORTH"
elif case3:
attr = "LOWER"
phrase_matcher = PhraseMatcher(self.vocab, attr=attr)
phrase_matcher.add(key, query)
matches = phrase_matcher(doc)
In case 1, I expect matching to be case insensitive, and if there were something in the spaCy library to enforce that lemmas are lowercased by default, this would be much more efficient than keeping multiple versions of the doc, and forcing one to have all lowercased characters.