Case-sensitive entity recognition
Asked Answered
K

2

12

I have keywords that are all stored in lower case, e.g. "discount nike shoes", that I am trying to perform entity extraction on. The issue I've run into is that spaCy seems to be case sensitive when it comes to NER. Mind you , I don't think that this is spaCy specific.

When I run...

doc = nlp(u"i love nike shoes from the uk")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

... nothing is returned.

When I run...

doc = nlp(u"i love Nike shoes from the Uk")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

I get the following results...

Nike 7 11 ORG
Uk 25 27 GPE

Should I just title case everything? Is there another workaround that I could use?

Kata answered 30/5, 2019 at 19:5 Comment(2)
I guess a named entity always start with an Uppercase. It's a syntax rule. en.wikipedia.org/wiki/Named-entity_recognitionPeculation
not exacly true, wilfried. NER systems usually use case as an imporant feature for structured prediction. Really Ines brought up the important points. If you have a special use case, you may need to train with your texts: spacy.io/usage/training#nerSolano
C
19

spaCy's pre-trained statistical models were trained on a large corpus of general news and web text. This means that the entity recognizer has likely only seen very few all-lowercase examples, because that's much less common in those types of texts. In English, capitalisation is also a strong indicator for a named entitiy (unlike German, where all nouns are typically capitalised), so the model probably tends to pay more attention to that.

If you're working with text that doesn't have proper capitalisation, you probably want to fine-tune the model to be less sensitive here. See the docs on updating the named entity recognizer for more details and code examples.

Producing the training examples will hopefully not be very difficult, because you can use existing annotations and datasets, or create one using the pre-trained model, and then lowercase everything. For example, you could take text with proper capitalisation, run the model over it and extract all entitiy spans in the text. Next, you lowercase all the texts, and update the model with the new data. Make sure to also mix in text with proper capitalisation, because you don't want the model to learn something like "Everything is lowercase now! Capitalisation doesn't exist anymore!".

Btw, if you have entities that can be defined using a list or set of rules, you might also want to check out the EntityRuler component. It can be combined with the statistical entity recognizer and will let you pass in a dictionary of exact matches or abstract token patterns that can be case-insensitive. For instance, [{"lower": "nike"}] would match one token whose lowercase form is "nike" – so "NIKE", "Nike", "nike", "NiKe" etc.

Confession answered 1/6, 2019 at 10:8 Comment(1)
Great ideas, thank you! Also a huge fan of what you guys are doing with spaCy. :)Kata
P
4

In general, non-standardized casing is problematic for pre-trained models.

You have a few workarounds:

  • Truecasing: correcting the capitalization in a text so you can use a standard NER model.
  • Caseless models: training NER models that ignore capitalization altogether.
  • Mixed case models: Training NER models on a mix of cased and uncased text.

I would recommend Truecasing, as there are some decent open-source truecasers out there with good accuracy, and they allow you to then use pre-trained NER solutions such as spaCy.

Caseless and mixed-case models are more time-consuming to set up and won't necessarily give better results.

Pestilential answered 3/6, 2019 at 13:48 Comment(2)
I can't devote too many resources to this project so I really like the idea of using the pre-trained truecaser. I will definitely be checking that out. Thanks!Kata
@emma-jean Just keep in mind that truecasing can be tricky when it comes to organization names. You may need to train the truecaser specifically to better handle orgsPestilential

© 2022 - 2024 — McMap. All rights reserved.