I'm currently working on replacing a system based on nltk entity extraction combined with regexp matching where I have several named entity dictionaries. The dictionary entities are both of common type (PERSON (employees) etc.) as well as custom types (e.g. SKILL). I want to use the pre-trained spaCy model and include my dictionaries somehow, to increase the NER accuracy. Here are my thoughts on possible methods:
Use spaCy's Matcher API, iterate through the dictionary and add each phrase with a callback to add the entity?
I've just found spacy-lookup, which seems like an easy way to provide long lists of words/phrases to match.
But what if I want to have fuzzy matching? Is there a way to add directly to the Vocab and thus have some fuzzy matching through Bloom filter / n-gram word vectors, or is there some extension out there that suits this need? Otherwise I guess I could copy spacy-lookup and replace the flashtext machinery with something else, e.g. Levenshtein distance.
While playing around with spaCy I did try just training the NER directly with a single word from the dictionary (without any sentence context), and this did "work". But I would, of course, have to take much care to keep the model from forgetting everything.
Any help appreciated, I feel like this must be a pretty common requirement and would love to hear what's working best for people out there.