Checklist
The following checklist is focused on runtime performance optimization and not training (i.e. when one utilises existing config.cfg
files loaded with the convenience wrapper spacy.load()
, instead of training their own models and creating a new config.cfg
file), however, most of the points still apply. This list is not comprehensive: the spaCy library is extensive and there are many ways to build pipelines and carry out tasks. Thus, including all cases here is impractical, regardless, this list intends to be a handy reference and starting point.
Summary
- If more powerful hardware is available, use it.
- Use (optimally) small models/pipelines.
- Use your GPU if possible.
- Process large texts as a stream and buffer them in batches.
- Use multiprocessing (if appropriate).
- Use only necessary pipeline components.
- Save and load progress to avoid re-computation.
1. If more powerful hardware is available, use it.
CPU. Most of spaCy's work at runtime is going to be using CPU instructions to allocate memory, assign values to memory and perform computations, which, in terms of speed, will be CPU bound not RAM, hence, performance is predominantly dependent on the CPU. So, opting for a better CPU as opposed to more RAM is the smarter choice in most situations. As a general rule, newer CPUs with higher frequencies, more cores/threads, more cache etc. will realise faster spaCy processing times. However, simply comparing these numbers between different CPU architectures is not useful. Instead look at benchmarks like cpu.userbenchmark.com (e.g. i5-12600k vs. Ryzen 9 5900X) and compare the single-core and multi-core performance of prospective CPUs to find those that will likely offer better performance. See Footnote (1) on hyperthreading & core/thread counts.
RAM. The practical consideration for RAM is the size: larger texts require more memory capacity, speed and latency is less important. If you have limited RAM capacity, disable NER
and parser
when creating your Doc
for large input text (e.g. doc = nlp("My really long text", disable = ['ner', 'parser'])
). If you require these parts of the pipeline, you'll only be able to process approximately 100,000 * available_RAM_in_GB
characters at a time, if you don't, you'll be able to process more than this. Note that the default spaCy input text limit is 1,000,000 characters, however this can be changed by setting nlp.max_length = your_desired_length
.
GPU. If you opt to use a GPU, processing times can be improved for certain aspects of the pipeline which make use of GPU-based computations. See the section below on making use of your GPU. The same general rule as with CPUs applies here too: generally, newer GPUs with higher frequencies, more memory, larger memory bus widths, bigger bandwidth etc. will realise faster spaCy processing times.
Overclocking. If you're experienced with overclocking and have the correct hardware to be able to do it (adequate power supply, cooling, motherboard chipset), then another effective way to gain extra performance without changing hardware is to overclock your CPU/GPU.
2. Use (optimally) small models/pipelines.
When computation resources are limited, and/or accuracy is less of a concern (e.g. when experimenting or testing ideas), load spaCy pipelines that are efficiency focused (i.e. those with smaller models). For example:
# Load a "smaller" pipeline for faster processing
nlp = spacy.load("en_core_web_sm")
# Load a "larger" pipeline for more accuracy
nlp = spacy.load("en_core_web_trf")
As a concrete example of the differences, on the same system, the smaller en_core_web_lg
pipeline is able to process 10,014 words per second, whereas the en_core_web_trf
pipeline only processes 684. Remember that there is often a trade-off between speed and accuracy.
3. Use your GPU if possible.
Due to the nature of neural network-based models, their computations can be efficiently solved using a GPU, leading to boosts in processing times. For instance, the en_core_web_lg
pipeline can process 10,014 vs. 14,954 words per second when using a CPU vs. a GPU.
spaCy can be installed for a CUDA compatible GPU (i.e. Nvidia GPUs) by calling pip install -U spacy[cuda]
in the command prompt. Once a GPU-enabled spaCy installation is present, one can call spacy.prefer_gpu()
or spacy.require_gpu()
somewhere in your program before any pipelines have been loaded. Note that require_gpu()
will raise an error if no GPU is available. For example:
spacy.prefer_gpu() # Or use spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")
4. Process large texts as a stream and buffer them in batches.
When processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts (default is 1000), and process the texts as a stream using nlp.pipe()
. For example:
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, batch_size=1000))
5. Use multiprocessing (if appropriate).
To make use of multiple CPU cores, spaCy includes built-in support for multiprocessing with nlp.pipe()
using the n_process
option. For example,
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, n_process=4))
Note that each process requires its own memory. This means that every time a new process is spawned
(the default start method), model data has to be copied into memory for every individual process (hence, the larger the model, the more overhead to spawn a process). Therefore, it is recommended that if you are just doing small tasks, that you increase the batch size and use fewer processes. For example,
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, n_process=2, batch_size=2000)) # default batch_size = 1000
Finally, multiprocessing is generally not recommended on GPUs because RAM is limited.
6. Use only necessary pipeline components.
Generating predictions from models in the pipeline that you don't require unnecessarily degrades performance. One can prevent this by either disabling or excluding specific components, either when loading a pipeline (i.e. with spacy.load()
) or during processing (i.e. with nlp.pipe()
).
If you have limited memory, exclude
the components you don't need, for example:
# Load the pipeline without the entity recognizer
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
If you might need a particular component later in your program, but still want to improve processing speed for tasks that don't require those components in the interim, use disable
, for example:
# Load the tagger but don't enable it
nlp = spacy.load("en_core_web_sm", disable=["tagger"])
# ... perform some tasks with the pipeline that don't require the tagger
# Eventually enable the tagger
nlp.enable_pipe("tagger")
Note that the lemmatizer
depends on tagger
+attribute_ruler
or morphologizer
for a number of languages. If you disable any of these components, you’ll see lemmatizer warnings unless the lemmatizer is also disabled.
7. Save and load progress to avoid re-computation.
If one has been modifying the pipeline or vocabulary, made updates to model components, processed documents etc., there is merit in saving one's progress to reload at a later date. This requires one to translate the contents/structure of an object into a format that can be saved -- a process known as serialization
.
Serializing the pipeline
nlp = spacy.load("en_core_web_sm")
# ... some changes to pipeline
# Save serialized pipeline
nlp.to_disk("./en_my_pipeline")
# Load serialized pipeline
nlp.from_disk("./en_my_pipeline")
Serializing multiple Doc
objects
The DocBin
class provides an easy method for serializing/deserializing multiple Doc
objects, which is also more efficient than calling Doc.to_bytes()
on every Doc
object. For example:
from spacy.tokens import DocBin
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts))
doc_bin = DocBin(docs=docs)
# Save the serialized DocBin to a file
doc_bin.to_disk("./data.spacy")
# Load a serialized DocBin from a file
doc_bin = DocBin().from_disk("./data.spacy")
Footnotes
(1) "Hyper-threading" is a term trademarked by Intel used to refer to their proprietary Simultaneous Multi-Threading (SMT) implementation that improves parallelisation of computations (i.e. doing multiple tasks at once). AMD has SMT as well, it just doesn't have a fancy name. In short, processors with 2-way SMT (SMT-2) allow an Operating System (OS) to treat each physical core on the processor as two cores (referred to as "virtual cores"). Processors with SMT will perform better on tasks that can make use of these multiple "cores", sometimes referred to as "threads" (e.g. the Ryzen 5600X is an 6 core/12 thread processor (i.e. 6 physical cores, but with SMT-2, it has 12 "virtual cores" or "threads")). Note that Intel has recently released a CPU architecture with e-cores, which are cores that don't have hyper-threading, despite other cores on the processor (namely, p-cores) having it, hence you will see some chips like the i5-12600k that have 10 cores with hyper-threading, but it has 16 threads not 20. This is because only the 6 p-cores have hyper-threading, while the 4 e-cores do not, hence 16 threads total.