The main issue is how to load and combine pipeline components such that they are using the same Vocab
(nlp.vocab
), since a pipeline assumes that all components share the same vocab and otherwise you can get errors related to the StringStore
.
You shouldn't try to combine pipeline components that were trained with different word vectors, but as long as the vectors are the same it's a question of how to load components from separate models with the same vocab.
There's no way to do this with spacy.load()
, so I think the simplest option is to initialize a new pipeline component with the required vocab and reload the existing component into the new component by temporarily serializing it.
To have a short working demo with easily accessible models, I'll show how to add the German NER model from de_core_news_sm
to the English model en_core_web_sm
even though it's not something you'd typically want to do:
import spacy # tested with v2.2.3
from spacy.pipeline import EntityRecognizer
text = "Jane lives in Boston. Jan lives in Bremen."
# load the English and German models
nlp_en = spacy.load('en_core_web_sm') # NER tags PERSON, GPE, ...
nlp_de = spacy.load('de_core_news_sm') # NER tags PER, LOC, ...
# the Vocab objects are not the same
assert nlp_en.vocab != nlp_de.vocab
# but the vectors are identical (because neither model has vectors)
assert nlp_en.vocab.vectors.to_bytes() == nlp_de.vocab.vectors.to_bytes()
# original English output
doc1 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc1.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Bremen', 'GPE')]
# original German output (the German model makes weird predictions for English text)
doc2 = nlp_de(text)
print([(ent.text, ent.label_) for ent in doc2.ents])
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]
# initialize a new NER component with the vocab from the English pipeline
ner_de = EntityRecognizer(nlp_en.vocab)
# reload the NER component from the German model by serializing
# without the vocab and deserializing using the new NER component
ner_de.from_bytes(nlp_de.get_pipe("ner").to_bytes(exclude=["vocab"]))
# add the German NER component to the end of the English pipeline
nlp_en.add_pipe(ner_de, name="ner_de")
# check that they have the same vocab
assert nlp_en.vocab == ner_de.vocab
# combined output (English NER runs first, German second)
doc3 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc3.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Jan lives', 'PER'), ('Bremen', 'GPE')]
Spacy's NER components (EntityRuler
and EntityRecognizer
) are designed to preserve any existing entities, so the new component only adds Jan lives
with the German NER tag PER
and leaves all other entities as predicted by the English NER.
You can use options for add_pipe()
to determine where the component is inserted in the pipeline. To add the German NER before the default English NER:
nlp_en.add_pipe(ner_de, name="ner_de", before="ner")
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]
All the add_pipe()
options are in the docs: https://spacy.io/api/language#add_pipe
You can save the extended pipeline to disk as a single model so you can load it in one line with spacy.load()
the next time:
nlp_en.to_disk("/path/to/model")
nlp_reloaded = spacy.load("/path/to/model")
print(nlp_reloaded.pipe_names) # ['tagger', 'parser', 'ner', 'ner_de']
spaCy
NER models to one text after all thespaCy
's bug fixes. – Saucepan