First I want to say that I don't have much experience with pytorch, ML, NLP and other related topics, so I may confuse some concepts. Sorry.
I downloaded few models from Hugging Face, organized them in one Python script and started to perform benchmark to get overview of performance. During benchmark I monitored CPU usage and saw that only 50% of CPU was used. I have 8 vCPU, but only 4 of them are loaded at 100% at the same time. The load is jumping, i.e. there may be cores 1, 3, 5, 7 that are loaded at 100%, then cores 2, 4, 6, 8 that are loaded at 100%. But in total CPU load never raises above 50%, it also never goes below 50%. This 50% load is constant.
After quick googling I found parallelism doc. I called get_num_threads()
and get_num_interop_threads()
and output was 4
for both calls. Only 50% of available CPU cores which kind of explains why CPU load was at 50%.
Then I called set_num_threads(8)
and set_num_interop_threads(8)
, and then performed benchmark. CPU usage was at constant 100%. In general performance was a bit faster, but some models started to work a bit slowly than at 50% of CPU.
So I wonder why pytorch by default uses only half of CPU? It is optimal and recommended way? Should I manually call set_num_threads()
and set_num_interop_threads()
with all available CPU cores if I want to achieve best performance?
Edit.
I made an additional benchmarks:
- one pytorch process with 50% of vCPU is a bit faster than one pytorch process with 100% of vCPU. Earlier it was vice versa, so I think it depends on models that are being used.
- two pytorch concurrent processes with 50% of vCPU will handle more inputs than one pytorch process with 50% of vCPU, but it is not 2x increase, it is ~1.2x increase. Process time of one input is much slower than with one pytorch process.
- two pytroch concurrent processes with 100% of vCPU can't complete even one input. I guess CPU is constantly switching between these processes.
So thank you to Phoenix's answer, I think it is completely reasonable to use pytorch default settings which sets number of threads according to number of physical (not virtual) cores.
Edit.
pytorch documentation about this - https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html