What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?

Asked 14/2, 2018 at 9:39 Answered 3/12, 2019 at 6:16

python nlp named-entity-recognition spacy

I'm currently working on replacing a system based on nltk entity extraction combined with regexp matching where I have several named entity dictionaries. The dictionary entities are both of common type (PERSON (employees) etc.) as well as custom types (e.g. SKILL). I want to use the pre-trained spaCy model and include my dictionaries somehow, to increase the NER accuracy. Here are my thoughts on possible methods:

Use spaCy's Matcher API, iterate through the dictionary and add each phrase with a callback to add the entity?
I've just found spacy-lookup, which seems like an easy way to provide long lists of words/phrases to match.
But what if I want to have fuzzy matching? Is there a way to add directly to the Vocab and thus have some fuzzy matching through Bloom filter / n-gram word vectors, or is there some extension out there that suits this need? Otherwise I guess I could copy spacy-lookup and replace the flashtext machinery with something else, e.g. Levenshtein distance.
While playing around with spaCy I did try just training the NER directly with a single word from the dictionary (without any sentence context), and this did "work". But I would, of course, have to take much care to keep the model from forgetting everything.

Any help appreciated, I feel like this must be a pretty common requirement and would love to hear what's working best for people out there.

Clanton answered 14/2, 2018 at 9:39 Comment(1)

a related concept: ensemble methods to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. (scikit-learn.org/stable/modules/ensemble.html) Here several NER taggers (based on neural networks, patterns and gazetteers could be combined, each one being an individual estimator) – Firn 12/9, 2019 at 13:43

I would recommend looking at spaCy's Entity Ruler. If you convert your existing dictionary into the schema for matching, you can add rules for each of your entities and new types.

This is quite powerful because you can combine it with the existing statistical NER available in a standard spacy model to achieve some of the "fuzzy matching" you mention. From the docs:

The entity ruler is designed to integrate with spaCy’s existing statistical models and enhance the named entity recognizer. If it’s added before the "ner" component, the entity recognizer will respect the existing entity spans and adjust its predictions around it. This can significantly improve accuracy in some cases. If it’s added after the "ner" component, the entity ruler will only add spans to the doc.ents if they don’t overlap with existing entities predicted by the model. To overwrite overlapping entities, you can set overwrite_ents=True on initialization.

Favoritism answered 2/12, 2019 at 3:1 Comment(0)

I use the Matcher with dynamically generated callbacks. I think it works well.

I got curious why the Matcher doesn't support fuzzy matching, and found this comment by the author of spacy on a closed issue.

You really want to precompute the search sets, rather than do them on-the-fly in the matcher. Once you've precomputed the similarity values, you can use extension attributes and a >= comparison in the Matcher to perform the search. I think this is a case where the implementation details strongly matter, and an API that obscures them would actually be a disservice.

I think this is a good point, and it tells you how to build what you want.

Rubino answered 3/12, 2019 at 6:16 Comment(0)

Recommended topics

Hot tags