How to create NER pipeline with multiple models in Spacy

I am trying to train new entities for spacy NER. I tried adding my new entity to existing spacy 'en' model. However, this affected the prediction model for both 'en' and my new entity.

I, therefore, created a blank model and trained the entity recognition. This works well. However, it is capable of predicting only the ones I have trained for and not the regular spacy entity recognition.

Say I trained 'horses' as ANIMAL entity.

For a given text

txt ='Did you know that George bought those horses for 10000 dollars?'

am expecting the following entities to be recognized

George - PERSON
horses - ANIMAL
10000 dollars - MONEY.

With my current setup, it only recognized horses.

nlp = spacy.load('en')
hsnlp = spacy.load('models/spacy/animal/')
nlp.add_pipe(hsnlp.pipeline[-1][-1], 'hsner')

nlp.pipe_names

this gives

----------------------
['tagger', 'parser', 'ner', 'hsner']
----------------------

However when I try to execute

doc = nlp(txt)  *<-- Gives me kernel error and stops working*

Please let me know how to create a pipeline for NER in spacy effectively. Am using spacy 2.0.18

The main issue is how to load and combine pipeline components such that they are using the same Vocab (nlp.vocab), since a pipeline assumes that all components share the same vocab and otherwise you can get errors related to the StringStore.

You shouldn't try to combine pipeline components that were trained with different word vectors, but as long as the vectors are the same it's a question of how to load components from separate models with the same vocab.

There's no way to do this with spacy.load(), so I think the simplest option is to initialize a new pipeline component with the required vocab and reload the existing component into the new component by temporarily serializing it.

To have a short working demo with easily accessible models, I'll show how to add the German NER model from de_core_news_sm to the English model en_core_web_sm even though it's not something you'd typically want to do:

import spacy # tested with v2.2.3
from spacy.pipeline import EntityRecognizer

text = "Jane lives in Boston. Jan lives in Bremen."

# load the English and German models
nlp_en = spacy.load('en_core_web_sm')  # NER tags PERSON, GPE, ...
nlp_de = spacy.load('de_core_news_sm') # NER tags PER, LOC, ...

# the Vocab objects are not the same
assert nlp_en.vocab != nlp_de.vocab

# but the vectors are identical (because neither model has vectors)
assert nlp_en.vocab.vectors.to_bytes() == nlp_de.vocab.vectors.to_bytes()

# original English output
doc1 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc1.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Bremen', 'GPE')]

# original German output (the German model makes weird predictions for English text)
doc2 = nlp_de(text)
print([(ent.text, ent.label_) for ent in doc2.ents])
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]

# initialize a new NER component with the vocab from the English pipeline
ner_de = EntityRecognizer(nlp_en.vocab)

# reload the NER component from the German model by serializing
# without the vocab and deserializing using the new NER component
ner_de.from_bytes(nlp_de.get_pipe("ner").to_bytes(exclude=["vocab"]))

# add the German NER component to the end of the English pipeline
nlp_en.add_pipe(ner_de, name="ner_de")

# check that they have the same vocab
assert nlp_en.vocab == ner_de.vocab

# combined output (English NER runs first, German second)
doc3 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc3.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Jan lives', 'PER'), ('Bremen', 'GPE')]

Spacy's NER components (EntityRuler and EntityRecognizer) are designed to preserve any existing entities, so the new component only adds Jan lives with the German NER tag PER and leaves all other entities as predicted by the English NER.

You can use options for add_pipe() to determine where the component is inserted in the pipeline. To add the German NER before the default English NER:

nlp_en.add_pipe(ner_de, name="ner_de", before="ner")
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]

All the add_pipe() options are in the docs: https://spacy.io/api/language#add_pipe

You can save the extended pipeline to disk as a single model so you can load it in one line with spacy.load() the next time:

nlp_en.to_disk("/path/to/model")
nlp_reloaded = spacy.load("/path/to/model")
print(nlp_reloaded.pipe_names) # ['tagger', 'parser', 'ner', 'ner_de']

Recommended topics

Hot tags