Python Numpy : np.int32 "slower" than np.float64
Asked Answered
P

2

12

I would like to understand a strange behavior of python. Let us consider a matrix Mwith shape 6000 x 2000. This matrix is filled with signed integers. I want to compute np.transpose(M)*M. Two options:

  • When I do it "naturally" (i.e. without specifying any typing), numpy selects the type np.int32 and the operation takes around 150s.
  • When I force the type to be np.float64 (using dtype=...), the same operation takes around 2s.

How can we explain this behavior ? I was naively thinking that a int multiplication was cheaper than a float multiplication.

Thanks a lot for your help.

Pinder answered 11/9, 2013 at 14:3 Comment(1)
Does it make sense that with int32 it gets overflows which cause python overhead, while the float64 dont have the problem? Not an expert on how stuff is handled internally in numpy, but it is all i can think ofJoiejoin
F
7

No, integer multiplies aren't cheaper. But more on that later. Most likely (I am 99% sure) numpy calls BLAS routine under blankets, which can be as efficient as 90% of peak CPU performance. There aren't special provisions for int matrix multiplies, most likely it is done in Python rather than machine-compiled version - I am actually wrong on this, see below.

With regards to int vs float speed: on most architectures (Intel) they are roughly the same, around 3-5 cycles or so per instruction, both have serial (X87) and vector (XMM) version. On Sandy bridge, PMUL*** (integer vector multiply) is 5 cycles and so are the MULP* (floating multiplies). With Sandy Bridge you also have 256-bit SIMD vectors ops (YMM) - you get 8 float ops per instructions - I am not sure if there is an int counterpart.

This here is a great reference: http://www.agner.org/optimize/instruction_tables.pdf

That said, instruction latencies don't explain 75X speed difference. It is probably a combination of optimized BLAS (threaded probably) and int32 being handled in Python rather than C/Fortran.

I profiled following snippet:

>>> F = (np.random.random((6000,2000))+4)
>>> I = F.astype(np.int32)
>>> np.dot(F, F.transpose()); np.dot(I, I.transpose())

and this is what oprofile says:

CPU_CLK_UNHALT...|
  samples|      %|
------------------
  2076880 51.5705 multiarray.so
  1928787 47.8933 libblas.so.3.0

However the libblas is unoptimized serial Netlib Blas. With a good BLAS implementation that 47% will be much lower, especially if it is threaded.

Edit: It seems numpy does provide compiled version of integer matrix multiply.

Fishgig answered 11/9, 2013 at 14:13 Comment(5)
That would be a good explanation. I am sure as well that numpy leverage BLAS for linear algebra operations. It may default to pure Python when it comes to int matrix though, which could explain this huge difference.Pinder
If you really want to get to the root, fire up OProfile to profile your application. It will point exactly to where slow downs are.Fishgig
By looking quickly to github.com/numpy/numpy/blob/master/numpy/core/blasdot/…, it seems indeed that python uses BLAS specifically for types float, double, cfloat and cdouble.Pinder
This is very surprising... The sensible thing to do would be to cast the int32 to float64, with no precision loss, run them through the fast BLAS library, convert the result back to int32. That's the way most UFUNCs in numpy would do things. np.dot on ints is about 7x slower than numpy.core.umath_tests.matrix_multiply, which is a straightforward C implementation of matrix multiplication. Weird!Demark
@Demark Could probably do in float32, would be little faster. Although accounting for int overflow in its float form would be a nightmare, unless one is sure that no overflow happens.Fishgig
S
6

Some supplemental information that I found through experimentation.

This can be circumvented. Timings are on a intel cpu with intel mkl for BLAS. Im also using fortran ordered arrays to keep everything equivalent a the scipy.linalg.blas is the fortran BLAS.

Lets take the following example:

from scipy.linalg.blas import sgemm
from scipy.linalg.blas import dgemm

arr_int64 = np.random.randint(-500,500,(6000,2000))

arr_int32 = array_int64.astype(np.int32)

arr_float64 = array_int64.astype(np.float64)+np.random.rand(6000,2000)

arr_float32 = array_int64.astype(np.float32)

First lets take the DGEMM calls.

%timeit np.dot(arr_float64.T,arr_float64) #400% CPU threaded BLAS
1 loops, best of 3: 969 ms per loop

%timeit np.dot(arr_float32.T,arr_float32) #400% CPU threaded BLAS
1 loops, best of 3: 513 ms per loop

%timeit np.dot(arr_int64.T,arr_int64)     #100% CPU?
1 loops, best of 3: 24.7 s per loop

%timeit np.dot(arr_int32.T,arr_int32)     #100% CPU?
1 loops, best of 3: 21.3 s per loop

Calling DGEMM/SGEMM directly:

%timeit dgemm(alpha=1, a=arr_float64, b=arr_float64, trans_a=True)
1 loops, best of 3: 1.13 s per loop

%timeit dgemm(alpha=1, a=arr_int64, b=arr_int64, trans_a=True)
1 loops, best of 3: 869 ms per loop

%timeit sgemm(alpha=1, a=arr_float32, b=arr_float32, trans_a=True)
1 loops, best of 3: 657 ms per loop

%timeit sgemm(alpha=1, a=arr_int32, b=arr_int32, trans_a=True)
1 loops, best of 3: 432 ms per loop

np.allclose( np.dot(arr_int32.T,arr_int32), sgemm(alpha=1, a=arr_int32, b=arr_int32, trans_a=True))
#True

Looks like something strange going on in the np.dot call. Similar to naive algorithm speed:

%timeit np.einsum('ij,jk',arr_int32.T,arr_int32)
1 loops, best of 3: 14.1 s per loop

%timeit np.einsum('ij,jk',arr_int64.T,arr_int64)
1 loops, best of 3: 26 s per loop
Similitude answered 11/9, 2013 at 16:2 Comment(2)
@Fishgig I would assume the usage of AVX with the intel mkl. Also perhaps the same reason np.einsum is ~2x as fast in single precision. See here.Similitude
Nevermind, I overlooked that sgemm times are in ms. It appeared to me as sgemm was 500 slower. Duh.Fishgig

© 2022 - 2024 — McMap. All rights reserved.