How to make spaCy case Insensitive
Asked Answered
C

2

6

How can I make spaCy case insensitive when finding the entity name?

Is there any code snippet that i should add or something because the questions could mention entities that are not in uppercase?

def analyseQuestion(question):

    doc = nlp(question)
    entity=doc.ents 

    return entity

print(analyseQuestion("what is the best seller of Nicholas Sparks "))  
print(analyseQuestion("what is the best seller of nicholas sparks "))    

which gives

(Nicholas Sparks,)  
()
Carolyn answered 16/6, 2018 at 12:17 Comment(0)
D
0

This is old, but this hopefully this will help anyone looking at similar problems.

You can use a truecaser to improve your results.

https://pypi.org/project/truecase/

Dilator answered 11/8, 2020 at 2:51 Comment(0)
E
-1

It is very easy. You just need to add a preprocessing step of question.lower() to your function:

def analyseQuestion(question):

    # Preprocess question to make further analysis case-insensetive
    question = question.lower()

    doc = nlp(question)
    entity=doc.ents 

    return entity

The solution inspired by this code from Rasa NLU library. However, for non-english (non-ASCII) text it might not work. For that case you can try:

question = question.decode('utf8').lower().encode('utf8')

However the NER module in spacy, to some extent depends on the case of the tokens and you might face some discrepancies as it is a statistical trained model.Refer this link.

Exacerbate answered 28/7, 2018 at 9:34 Comment(4)
I'm not sure this answers the question. I think what the OP is looking for is a way to get an instance like (Nicholas Sparks,) detected even when the sentence (and the potential entities) are in lowercase.Elver
@Elver Why not preprocessing the sentence to lowercase, apply NER that was trained on lowercase data, recognize the entity, return the position of the entity in that sentence, and show the part of the original sentence based on the position?Tarbox
@LoganYang yes - the key being "NER that was trained on lowercase data," which is what the OP is looking for. the attempted solution in this post is identical to the (non-working) second example that the OP already tried and reported didn't work (where lowercased "nicholas sparks" is not detected).Elver
I see what you mean. I thought this answer was about training on lowercase but it is just converting to lowercase at prediction time. What OP needs is to train on lowercase in the first place.Tarbox

© 2022 - 2024 — McMap. All rights reserved.