A checklist for Spacy optimization?
Asked Answered
T

1

8

I have been trying to understand how to systematically make Spacy run as fast as possible for a long time and I would like this post to become a wiki-style public post if possible.

Here is what I currently know, with subsidiary questions on each point:

1. Space will run faster on faster hardware. For example, try a computer with more CPU cores, or more RAM/primary memory.

What I do not know:

  • What specific aspects of the execution of Spacy - especially the main one of instantiating the Doc object - depend more on CPU vs. RAM and why?
  • Is the instantiation of a Doc object a sequence of arithmetical calculations (the compiled binary of the neural networks), so the more CPU cores, the more calculations can be done at once, therefore faster? Does that mean increasing RAM would not make this process faster?
  • Are there any other aspects of CPUs or GPUs to watch out for, other than cores, that would make one chip better than another, for Spacy? Someone mentioned "hyper threading".
  • Is there any standard mathematical estimate of time per pipeline component, such as parser, relative to input string length? Like Parser, seconds = number of characters in input? / number of CPU cores

2. You can make Spacy run faster by removing components you don't need, for example by nlp = spacy.load("en_core_web_sm", disable=['tagger', 'ner', 'lemmatizer', 'textcat'])

  • Just loading the Spacy module itself with import spacy is slightly slow. If you haven't even loaded the language model yet, what are the most significant things being loaded here, apart from just adding functions to the namespace? Is it possible to only load a part of the module you need?

3. You can make Spacy faster by using certain options that simply make it run faster.

  • I have read about multiprocessing with nlp.pipe, n_process, batch_size and joblib, but that's for multiple documents and I'm only doing a single document right now.

4. You can make Spacy faster by minimising the number of times it has to perform the same operations.

  • You can keep Spacy alive on a server and pass processing commands to it when you need to

  • You can serialize a Doc to reload it later, and you can further exclude attributes you don't need with doc.to_bytes(exclude=["tensor"]) or doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])

5. Anything else?

Tanta answered 24/10, 2022 at 13:23 Comment(2)
I think the well-known one is sentencizer vs. senter module (rule based vs data driven model). I am interested to know the answers to your questions.Jaquiss
Stack Overflow isn't made for wiki-style comprehensive posts like this, and in fact it's very hard to answer so many different questions in the SO format.Schreibman
P
7

Checklist

The following checklist is focused on runtime performance optimization and not training (i.e. when one utilises existing config.cfg files loaded with the convenience wrapper spacy.load(), instead of training their own models and creating a new config.cfg file), however, most of the points still apply. This list is not comprehensive: the spaCy library is extensive and there are many ways to build pipelines and carry out tasks. Thus, including all cases here is impractical, regardless, this list intends to be a handy reference and starting point.

Summary

  1. If more powerful hardware is available, use it.
  2. Use (optimally) small models/pipelines.
  3. Use your GPU if possible.
  4. Process large texts as a stream and buffer them in batches.
  5. Use multiprocessing (if appropriate).
  6. Use only necessary pipeline components.
  7. Save and load progress to avoid re-computation.

1. If more powerful hardware is available, use it.

CPU. Most of spaCy's work at runtime is going to be using CPU instructions to allocate memory, assign values to memory and perform computations, which, in terms of speed, will be CPU bound not RAM, hence, performance is predominantly dependent on the CPU. So, opting for a better CPU as opposed to more RAM is the smarter choice in most situations. As a general rule, newer CPUs with higher frequencies, more cores/threads, more cache etc. will realise faster spaCy processing times. However, simply comparing these numbers between different CPU architectures is not useful. Instead look at benchmarks like cpu.userbenchmark.com (e.g. i5-12600k vs. Ryzen 9 5900X) and compare the single-core and multi-core performance of prospective CPUs to find those that will likely offer better performance. See Footnote (1) on hyperthreading & core/thread counts.

RAM. The practical consideration for RAM is the size: larger texts require more memory capacity, speed and latency is less important. If you have limited RAM capacity, disable NER and parser when creating your Doc for large input text (e.g. doc = nlp("My really long text", disable = ['ner', 'parser'])). If you require these parts of the pipeline, you'll only be able to process approximately 100,000 * available_RAM_in_GB characters at a time, if you don't, you'll be able to process more than this. Note that the default spaCy input text limit is 1,000,000 characters, however this can be changed by setting nlp.max_length = your_desired_length.

GPU. If you opt to use a GPU, processing times can be improved for certain aspects of the pipeline which make use of GPU-based computations. See the section below on making use of your GPU. The same general rule as with CPUs applies here too: generally, newer GPUs with higher frequencies, more memory, larger memory bus widths, bigger bandwidth etc. will realise faster spaCy processing times.

Overclocking. If you're experienced with overclocking and have the correct hardware to be able to do it (adequate power supply, cooling, motherboard chipset), then another effective way to gain extra performance without changing hardware is to overclock your CPU/GPU.

2. Use (optimally) small models/pipelines.

When computation resources are limited, and/or accuracy is less of a concern (e.g. when experimenting or testing ideas), load spaCy pipelines that are efficiency focused (i.e. those with smaller models). For example:

# Load a "smaller" pipeline for faster processing
nlp = spacy.load("en_core_web_sm")
# Load a "larger" pipeline for more accuracy
nlp = spacy.load("en_core_web_trf")

As a concrete example of the differences, on the same system, the smaller en_core_web_lg pipeline is able to process 10,014 words per second, whereas the en_core_web_trf pipeline only processes 684. Remember that there is often a trade-off between speed and accuracy.

3. Use your GPU if possible.

Due to the nature of neural network-based models, their computations can be efficiently solved using a GPU, leading to boosts in processing times. For instance, the en_core_web_lg pipeline can process 10,014 vs. 14,954 words per second when using a CPU vs. a GPU.

spaCy can be installed for a CUDA compatible GPU (i.e. Nvidia GPUs) by calling pip install -U spacy[cuda] in the command prompt. Once a GPU-enabled spaCy installation is present, one can call spacy.prefer_gpu() or spacy.require_gpu() somewhere in your program before any pipelines have been loaded. Note that require_gpu() will raise an error if no GPU is available. For example:

spacy.prefer_gpu() # Or use spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")

4. Process large texts as a stream and buffer them in batches.

When processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts (default is 1000), and process the texts as a stream using nlp.pipe(). For example:

texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, batch_size=1000))

5. Use multiprocessing (if appropriate).

To make use of multiple CPU cores, spaCy includes built-in support for multiprocessing with nlp.pipe() using the n_process option. For example,

texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, n_process=4))

Note that each process requires its own memory. This means that every time a new process is spawned (the default start method), model data has to be copied into memory for every individual process (hence, the larger the model, the more overhead to spawn a process). Therefore, it is recommended that if you are just doing small tasks, that you increase the batch size and use fewer processes. For example,

texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts, n_process=2, batch_size=2000)) # default batch_size = 1000

Finally, multiprocessing is generally not recommended on GPUs because RAM is limited.

6. Use only necessary pipeline components.

Generating predictions from models in the pipeline that you don't require unnecessarily degrades performance. One can prevent this by either disabling or excluding specific components, either when loading a pipeline (i.e. with spacy.load()) or during processing (i.e. with nlp.pipe()).

If you have limited memory, exclude the components you don't need, for example:

# Load the pipeline without the entity recognizer
nlp = spacy.load("en_core_web_sm", exclude=["ner"])

If you might need a particular component later in your program, but still want to improve processing speed for tasks that don't require those components in the interim, use disable, for example:

# Load the tagger but don't enable it
nlp = spacy.load("en_core_web_sm", disable=["tagger"])
# ... perform some tasks with the pipeline that don't require the tagger
# Eventually enable the tagger
nlp.enable_pipe("tagger")

Note that the lemmatizer depends on tagger+attribute_ruler or morphologizer for a number of languages. If you disable any of these components, you’ll see lemmatizer warnings unless the lemmatizer is also disabled.

7. Save and load progress to avoid re-computation.

If one has been modifying the pipeline or vocabulary, made updates to model components, processed documents etc., there is merit in saving one's progress to reload at a later date. This requires one to translate the contents/structure of an object into a format that can be saved -- a process known as serialization.

Serializing the pipeline

nlp = spacy.load("en_core_web_sm")
# ... some changes to pipeline
# Save serialized pipeline
nlp.to_disk("./en_my_pipeline")
# Load serialized pipeline
nlp.from_disk("./en_my_pipeline")

Serializing multiple Doc objects

The DocBin class provides an easy method for serializing/deserializing multiple Doc objects, which is also more efficient than calling Doc.to_bytes() on every Doc object. For example:

from spacy.tokens import DocBin
texts = ["One document.", "...", "Lots of documents"]
nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(texts))
doc_bin = DocBin(docs=docs)
# Save the serialized DocBin to a file
doc_bin.to_disk("./data.spacy")
# Load a serialized DocBin from a file
doc_bin = DocBin().from_disk("./data.spacy")

Footnotes

(1) "Hyper-threading" is a term trademarked by Intel used to refer to their proprietary Simultaneous Multi-Threading (SMT) implementation that improves parallelisation of computations (i.e. doing multiple tasks at once). AMD has SMT as well, it just doesn't have a fancy name. In short, processors with 2-way SMT (SMT-2) allow an Operating System (OS) to treat each physical core on the processor as two cores (referred to as "virtual cores"). Processors with SMT will perform better on tasks that can make use of these multiple "cores", sometimes referred to as "threads" (e.g. the Ryzen 5600X is an 6 core/12 thread processor (i.e. 6 physical cores, but with SMT-2, it has 12 "virtual cores" or "threads")). Note that Intel has recently released a CPU architecture with e-cores, which are cores that don't have hyper-threading, despite other cores on the processor (namely, p-cores) having it, hence you will see some chips like the i5-12600k that have 10 cores with hyper-threading, but it has 16 threads not 20. This is because only the 6 p-cores have hyper-threading, while the 4 e-cores do not, hence 16 threads total.

Pepe answered 25/10, 2022 at 12:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.