Why is Numpy with Ryzen Threadripper so much slower than Xeon?
Asked Answered
M

3

57

I know that Numpy can use different backends like OpenBLAS or MKL. I have also read that MKL is heavily optimized for Intel, so usually people suggest to use OpenBLAS on AMD, right?

I use the following test code:

import numpy as np

def testfunc(x):
    np.random.seed(x)
    X = np.random.randn(2000, 4000)
    np.linalg.eigh(X @ X.T)

%timeit testfunc(0)

I have tested this code using different CPUs:

  • On Intel Xeon E5-1650 v3, this code performs in 0.7s using 6 out of 12 cores.
  • On AMD Ryzen 5 2600, this code performs in 1.45s using all 12 cores.
  • On AMD Ryzen Threadripper 3970X, this code performs in 1.55s using all 64 cores.

I am using the same Conda environment on all three systems. According to np.show_config(), the Intel system uses the MKL backend for Numpy (libraries = ['mkl_rt', 'pthread']), whereas the AMD systems use OpenBLAS (libraries = ['openblas', 'openblas']). The CPU core usage was determined by observing top in a Linux shell:

  • For the Intel Xeon E5-1650 v3 CPU (6 physical cores), it shows 12 cores (6 idling).
  • For the AMD Ryzen 5 2600 CPU (6 physical cores), it shows 12 cores (none idling).
  • For the AMD Ryzen Threadripper 3970X CPU (32 physical cores), it shows 64 cores (none idling).

The above observations give rise to the following questions:

  1. Is that normal, that linear algebra on up-to-date AMD CPUs using OpenBLAS is that much slower than on a six-year-old Intel Xeon? (also addressed in Update 3)
  2. Judging by the observations of the CPU load, it looks like Numpy utilizes the multi-core environment in all three cases. How can it be that the Threadripper is even slower than the Ryzen 5, even though it has almost six times as many physical cores? (also see Update 3)
  3. Is there anything that can be done to speed up the computations on the Threadripper? (partially answered in Update 2)

Update 1: The OpenBLAS version is 0.3.6. I read somewhere, that upgrading to a newer version might help, however, with OpenBLAS updated to 0.3.10, the performance for testfunc is still 1.55s on AMD Ryzen Threadripper 3970X.


Update 2: Using the MKL backend for Numpy in conjunction with the environment variable MKL_DEBUG_CPU_TYPE=5 (as described here) reduces the run time for testfunc on AMD Ryzen Threadripper 3970X to only 0.52s, which is actually more or less satisfying. FTR, setting this variable via ~/.profile did not work for me on Ubuntu 20.04. Also, setting the variable from within Jupyter did not work. So instead I put it into ~/.bashrc which works now. Anyways, performing 35% faster than an old Intel Xeon, is this all we get, or can we get more out of it?


Update 3: I play around with the number of threads used by MKL/OpenBLAS:

run time by the number of threads and library

The run times are reported in seconds. The best result of each column is underlined. I used OpenBLAS 0.3.6 for this test. The conclusions from this test:

  • The single-core performance of the Threadripper using OpenBLAS is a bit better than the single-core performance of the Xeon (11% faster), however, its single-core performance is even better when using MKL (34% faster).
  • The multi-core performance of the Threadripper using OpenBLAS is ridiculously worse than the multi-core performance of the Xeon. What is going on here?
  • The Threadripper performs overall better than the Xeon, when MKL is used (26% to 38% faster than Xeon). The overall best performance is achieved by the Threadripper using 16 threads and MKL (36% faster than Xeon).

Update 4: Just for clarification. No, I do not think that (a) this or (b) that answers this question. (a) suggests that "OpenBLAS does nearly as well as MKL", which is a strong contradiction to the numbers I observed. According to my numbers, OpenBLAS performs ridiculously worse than MKL. The question is why. (a) and (b) both suggest using MKL_DEBUG_CPU_TYPE=5 in conjunction with MKL to achieve maximum performance. This might be right, but it does neither explain why OpenBLAS is that dead slow. Neither it explains, why even with MKL and MKL_DEBUG_CPU_TYPE=5 the 32-core Threadripper is only 36% faster than the six-year-old 6-core Xeon.

Myers answered 7/7, 2020 at 20:19 Comment(23)
maybe relevant pugetsystems.com/labs/hpc/… also Google openblas vs MKLMillur
I'd suspect inter-core latency might be a bigger issue across CCX clusters of 4 cores on Threadripper? 3970X is a Zen 2 part, so it should have 2x 256-bit SIMD FMA throughput (per core), same as Intel Haswell. Perhaps a library tuned for AMD is only using 128-bit SIMD because that was sometimes better for Zen1. (Your Ryzen 5 2600 is a Zen1, 1x 128-bit FMA uop per clock, so it's crazy that it's slower than a Zen2). Different BLAS libraries might be a big factor.Propertied
Perhaps using both logical cores of one physical core might be creating more cache misses; only using 6 cores on the Intel CPU leaves the full size of the private caches of each physical core for one thread. Also, what clock speeds are those chips running at? They should be similar.Propertied
I'd advise to run comparisons with different number of threads (OPENBLAS_NUM_THREADS, MKL_NUM_THREADS). Server processors have slower per-core speed, and multicore speedups in BLAS libraries are usually very appalling.Ilse
Generating random numbers takes a lot of time (1/4 of total time on my system). It would be better to only get the timings of np.linalg.eigh(X @ X.T). Also set the MKL_NUM_THREADS to the number of physical threads. This BLAS algortihms usually scale negative with virtual cores.Gradey
For a broader overview you could also run ibench. Setting the OpenMP Thread granularity may also help. Adapting this github.com/fo40225/Anaconda-Windows-AMD MKL_DEBUG_CPU_TYPE=5 for your 32 core CPU (xxx_cpuinfo.txt)Gradey
Now that you have more perf ratios, including MKL on both machines, it would be even more useful / relevant to include clock speeds (specifically, the actual turbo clock speed your machine used when running those tests.)Propertied
@PeterCordes I was wondering, how to determine them?Myers
Intel documents the single-core max turbo, and you can just manually look at clock speeds while the benchmark is running. (grep MHz /proc/cpuinfo or whatever). Ideally run your program under perf on Linux: perf stat my_benchmark to record HW performance counters which includes the cycles event, and will calculate the average clock speed the CPU actually ran at over the benchmark interval. (By dividing cycles by the task-clock kernel event.)Propertied
Thanks @PeterCordes However, I had to change the benchmark a bit, since now I measure the execution time using perf stat and not using Python's own %timeit-instruction. This means that now the import numpy-instruction also is measured. This leads to different results. This is why I decided to summarize them in a Google Sheet first: If you think that this experiments provides more important insights, I will replace the experiment and the results in my original question. Let me know!Myers
You could run the whole Python timeit under perf just to find out the average clock speed, with the actual timed interval still being measured by Python. Or fork off a perf stat -p $PID after initializing, so it attaches right as you're starting the benchmark.Propertied
As far as I know: pandas, scikit, pytroch, tensorflow, matplotlib, IPython, Sympy and NumExpr using mkl, numpy is switching to openBLAS since 1.18. I was planning a threadripper workstation but I havent got the time and knowlege to compile every of these by my own. How you decide now?Haberdasher
@Haberdasher I'm running Numpy 1.18.5 with MKL and the MKL_DEBUG_CPU_TYPE-hack, the speed is ok.Myers
Does this answer your question? When you have an AMD CPU, can you speed up code that uses the Intel-MKL? your question has "the same answers. This includes not only word-for-word duplicates, but also the same idea expressed in different words". The linked question is more general (not specific to Ryzen/python/numpy). Disclaimer: The question I linked to is my own question.Lakesha
@TrevorBoydSmith See my Update 4Myers
@theV0ID re OpenBLAS performs ridiculously worse than MKL. The question is ... why OpenBLAS is that dead slow is the same in my opinion (or very similar) as asking why is an open-source software implementation slower than a closed-source software implementation? which can not be answered because the closed-source software is not available.Lakesha
@theV0ID re 32-core Threadripper is only 36% faster than the six-year-old 6-core Xeon how you profile and generate your measurements matters a lot when doing comparisons like this. If you post your benchmarking code, then someone could at least answer the question of 'why is my benchmarking code slower given x,y,z' (for example: Intel-Python benchmark code is open-sourced to show how/why so much faster).Lakesha
even if Intel's code was open-source it would still require knowledge of the hardware implementation... which again isn't available... and so an answer isn't possible because that hardware implementation isn't available.Lakesha
@TrevorBoydSmith My benchmarking code is at the top of the original questionMyers
@TrevorBoydSmith All I'm saying is that my measurements hugely contradict the observations in the link. According to the link, OpenBLAS is supposed to perform comparably good as MKL. It does not, not even close. The question is why. Someone else observed that OpenBLAS performs comparably good, so I do not believe that this boils down to open source vs closed source.Myers
Have you tried asking the maintainers at github.com/numpy/numpy. This is something I want to understand tooSolar
@Solar Good idea, but I had no time yet. I will try to get that done the next days.Myers
your benchmark is absolutely wrong and looks like your are trying to make holywar more than real comparison. This part X = np.random.randn(2000, 4000) most probably doesn't parallelise well. You have to use internal numpy benchmark for fair comparison numpy.org/doc/stable/benchmarking.html. You have to limit number of cores too because your matrices or test conditions might be to weak(small) for available resources (CPU core count and etc) which actually slows instead of boosts.Urquhart
D
4

As of 2021, Intel unfortunately removed the MKL_DEBUG_CPU_TYPE to prevent people on AMD use the workaround presented in the accepted answer. This means that the workaround no longer works, and AMD users have to either switch to OpenBLAS or keep using MKL.

To use the workaround, follow this method:

  1. Create a conda environment with conda's and NumPy's MKL=2019.
  2. Activate the environment
  3. Set MKL_DEBUG_CPU_TYPE = 5

The commands for the above steps:

  1. conda create -n my_env -c anaconda python numpy mkl=2019.* blas=*=*mkl
  2. conda activate my_env
  3. conda env config vars set MKL_DEBUG_CPU_TYPE=5

And thats it!

Droppings answered 26/8, 2021 at 17:8 Comment(10)
You do currently have enough rep to comment, thanks to your useful contributions getting upvotes :). This is actually a relevant answer for future readers facing the problem of slow MKL Numpy on AMD CPUs, though, so it's fine. In some cases it might be better to suggest an edit to an existing answer, pointing out that it doesn't work with the latest MKL, but here a separate answer makes as much sense as editing 3 different answers. Especially if you make this into an answer that does directly address the question here.Propertied
I think you can still use an older MKL version, right? At least, 2020.0 still works for me.Myers
I use mkl=2020.0 along with blas=*=mkl in my environment .yml files, however, I am not 100% sure that it works, since I have noticed some strange slow downs in a recently created environment.Myers
There is no "accepted answer" on this question. It's usually not a good idea to copy/paste the identical answers onto different questions, since future editors will need to find them both / all. This should probably still be a link to your answer on another question for the full step-by-step guide, maybe just say here to use 2019 MKL with the MKL_DEBUG_CPU_TYPE=5 environment setting, see that for full details.Propertied
And you can make the rest of this answer be specific to this question by describing what Intel's "cripple-AMD" function actually does.Propertied
I am confused: we are in October 2021 and typing !export MKL_DEBUG_CPU_TYPE=5 before running my Python script still improved the overall processing time.Far
@Far What is your MKL version? Use mkl.get_version_string() to find out.Droppings
Thanks for your reply @Astro: mkl.get_version_string() yields 'Intel(R) oneAPI Math Kernel Library Version 2021.2-Product Build 20210312 for Intel(R) 64 architecture applications'Far
@Far And how is that working? Because Intel removed that flag starting MKL 2021, and !export MKL_DEBUG_CPU_TYPE=5 shouldn't give you any boost. It is bizarre how you are able to get increased performance. Are you sure that you are getting an improvement? If yes, let us know!Droppings
People, it seems that the wokaround has been removed since it's no longer necessary: as of 2021, recent versions of MKL sould perform good even on Ryzens. I'd like Ryzen users to confirm this.Annieannihilate
U
2

Wouldn't it make sense to try using an optimized BLIS library from AMD?

Maybe I am missing (misunderstanding) something, but I would assume you could use BLIS instead of OpenBLAS. The only potential problem could be that AMD BLIS is optimized for AMD EPYC (but you're using Ryzen). I'm VERY curious about the results, since I'm in the process of buying a server for work, and am considering AMD EPYC and Intel Xeon.

Here are the respective AMD BLIS libraries: https://developer.amd.com/amd-aocl/

Urethroscope answered 13/8, 2020 at 14:15 Comment(2)
Even though installation of BLIS via conda looks easy, it seems non-straight forward to me how to make Numpy actually use BLIS as the backend. However, according to this, MKL outperforms BLIS on Ryzen ("with some quick/dirty results on my Ryzen 3700X [...] You can see performance basically double on MKL when MKL_DEBUG_CPU_TYPE=5 is used").Myers
How to compile and install numpy with BLIS linked to AMD's AOCL BLIS # download files from developer.amd.com/amd-aocl # unpack to e.g. /home/AOCL/2.2 # create ~/.numpy-site.cfg [blis] libraries = blis library_dirs = /home/AOCL/2.2/lib include_dirs = /home/AOCL/2.2/include runtime_library_dirs = /home/AOCL/2.2/lib # git clone github.com/numpy/numpy.git # cd numpy # pip install .Urethroscope
E
1

I think this should help:

"The best result in the chart is for the TR 3960x using MKL with the environment var MKL_DEBUG_CPU_TYPE=5. AND it is significantly better than the low optimization code path from MKL alone. AND,OpenBLAS does nearly as well as MKL with MKL_DEBUG_CPU_TYPE=5 set." https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications-1637/

How to set up: 'Make the setting permanent by entering MKL_DEBUG_CPU_TYPE=5 into the System Environment Variables. This has several advantages, one of them being that it applies to all instances of Matlab and not just the one opened using the .bat file' https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/?sort=new

Eventful answered 31/7, 2020 at 14:11 Comment(3)
If that fully explains the perf diff, this question is a duplicate of When you have an AMD CPU, can you speed up code that uses the Intel-MKL? . (Those links with more details and test results might be good as a comment there.)Propertied
Yeah, I've been on that link before, but doesn't the "OpenBLAS does nearly as well as MKL with MKL_DEBUG_CPU_TYPE=5" actually contradict the performance measures I reported? OpenBLAS does significantly worse than MKL.Myers
By strange coincidence I wrote the same solution a day earlier over here https://mcmap.net/q/1977310/-when-you-have-an-amd-cpu-can-you-speed-up-code-that-uses-the-intel-mkl for a more general question about Intel-MKL that is not specific to AMD-Ryzen and not specific to numpy. One of the comments on my solution pointed me over here.Lakesha

© 2022 - 2024 — McMap. All rights reserved.