How to avoid huge overhead of single-threaded NumPy's transpose?
Asked Answered
K

1

5

I currently encounter huge overhead because of NumPy's transpose function. I found this function virtually always run in single-threaded, whatever how large the transposed matrix/array is. I probably need to avoid this huge time cost.

To my understanding, other functions like np.dot or vector increment would run in parallel, if numpy array is large enough. Some element-wise operations seems to be better parallelized in package numexpr, but numexpr probably couldn't handle transpose.

I wish to learn what is the better way to resolve problem. To state this problem in detail,

  • Sometimes NumPy runs transpose ultrafast (like B = A.T) because the transposed tensor is not used in calculation or be dumped, and there is no need to really transpose data at this stage. When calling B[:] = A.T, that really do transpose of data.
  • I think a parallelized transpose function should be a resolution. The problem is how to implement it.
  • Hope the solution does not require packages other than NumPy. ctype binding is acceptable. And hope code is not too difficult to use nor too complicated.
  • Tensor transpose is a plus. Though techniques to transpose a matrix could be also utilized in specific tensor transpose problem, I think it could be difficult to write a universal API for tensor transpose. I actually also need to handle tensor transpose, but handling tensors could complicate this stackoverflow problem.
  • And if there's possibility to implement parallelized transpose in future, or there's a plan exists? Then there would be no need to implement transpose by myself ;)

Thanks in advance for any suggestions!


Current workarounds

Handling a model transpose problem (size of A is ~763MB) on my personal computer in Linux with 4-cores available (400% CPU in total).

A = np.random.random((9999, 10001))
B = np.random.random((10001, 9999))
D = np.random.random((9999, 10001))

The current workarounds seems not to be effective enough. Generally, it should see about 3x~4x speedup if fully parallelized on a 4-core CPU, but the code I've written only gains about 1.5x.

Baseline: Naive Transpose (~255 ms)

B[:] = A.T   # ~255 ms, time for transpose
D[:] = A[:]  # ~ 65 ms, time for coping data

It's intersting to find that if A is 10000 * 10000 square matrix, then transpose time will be increased to ~310ms. I don't know what happens here. Even the time cost of C/ctypes binding function will be affected (slower) if matrix is square.

C/ctypes Binding Function (~145 ms)

With the following OpenMP/BLAS (basic, not optimized) written:

// transpose.c
#include <stdio.h>
#include <cblas.h>
void matrix_transpose(double* A, double* B, size_t d1, size_t d2) {
    size_t i;
    #pragma omp parallel for shared(A, B, d1, d2) private(i)
    for (i = 0; i < d1; ++i) {
        cblas_dcopy(d2, A + i*d2, 1, B + i, d1);
    }
}

then execuate python codes (4 core threaded)

from numpy.ctypeslib import as_ctypes
matrix_transpose = np.ctypeslib.load_library("transpose.so", ".").matrix_transpose
matrix_transpose(
    as_ctypes(A), as_ctypes(C),
    ctypes.c_size_t(n1), ctypes.c_size_t(n2))  # ~145 ms

It could be somehow cumbersome to use C/ctype binding, since it is not pure python, and uses CBLAS as external package as well.

Multiprocessing (~165 ms)

nproc = 4
batch = n1 // nproc + 1  # batch 2500 in this case
slices = [slice(i * batch, min(n1, (i+1) * batch)) for i in range(nproc)]

cB = as_ctypes(B)
pB = sharedctypes.RawArray(cB._type_, cB)

def slice_transpose(s):
    B = np.asarray(pB)
    B[:, s] = A[s].T

with Pool(nproc) as pool:
    pool.map(slice_transpose, slices)  # ~165 ms
B = np.asarray(pB)

I guess that for large clusters (HPCs), more processes/threads does not necessarily gains speedup. So how to set the number of processes/threads may also be a problem.


Edit After Initial Question

This problem is not only related to parallelization, but also cache-aware and tile-based algorithm of transpose. Possible solutions could be

  • Use numba code with tile-based algorithm (stated by answer from Jérôme Richard). Though requires packages other than numpy, it's almost pure python.
  • Use optimized blas libraries (such as MKL or cuBLAS, which implement their own matrix transpose API) and link to python, instead of CBLAS or BLAS. Need to prepare makefiles or dynamic linked library if this program is to be distributed.
  • Use pyscf.lib.transpose (python link, c link) for parallel 3-index tensor transpose M.transpose(0,2,1). I'm somehow a fan of pyscf. It's main purpose is quantum or semi-empirical chemistry calculation, but some of it's numerical calculation utilities is probably optimized for this purpose. In my testing on a server, transposing a (1502, 601, 601) tensor could be twice faster (4.09 sec) than MKL mkl_domatcopy (9.19 sec).

Related algorithm article:

Related stackoverflow and github issue pages:

Karns answered 7/5, 2021 at 9:2 Comment(3)
A.T returns a view. The time consuming operations require a copy You can't address this issue without a clear understanding of the difference.Blowgun
Yes, exactly. That's why I have also give a time for operation D[:]=A[:] ~65 ms, compared to the transpose time ~265ms. The reason not to use D=A.copy() to benchmark copy time, is because that will allocate a new memory buffer which costs additional time. Also, B=A.T is also tested costs virtually no time, as expected.Karns
Keep in mind that D[:]=A[:] can copy contiguous blocks from both source and destination, while B[:]=A.T does not have a contiguous buffer at one end.Odysseus
G
7

Computing transpositions efficiently is hard. This primitive is not compute-bound but memory-bound. This is especially true for big matrices stored in the RAM (and not CPU caches).

Numpy use a view-based approach which is great when only a slice of the array is needed and quite slow the computation is done eagerly (eg. when a copy is performed). The way Numpy is implemented results in a lot of cache misses strongly decreasing performance when a copy is performed in this case.

I found this function virtually always run in single-threaded, whatever how large the transposed matrix/array is.

This is clearly dependant of the Numpy implementation used. AFAIK, some optimized packages like the one of Intel are more efficient and take more often advantage of multithreading.

I think a parallelized transpose function should be a resolution. The problem is how to implement it.

Yes and no. It may not be necessary faster to use more threads. At least not much more, and not on all platforms. The algorithm used is far more important than using multiple threads.

On modern desktop x86-64 processors, each core can be bounded by its cache hierarchy. But this limit is quite high. Thus, two cores are often enough to nearly saturate the RAM throughput. For example, on my (4-core) machine, a sequential copy can reach 20.4 GiB/s (Numpy succeed to reach this limit), while my (practical) memory throughput is close to 35 GiB/s. Copying A takes 72 ms while the naive Numpy transposition A takes 700 ms. Even using all my cores, a parallel implementation would not be faster than 175 ms while the optimal theoretical time is 42 ms. Actually, a naive parallel implementation would be much slower than 175 ms because of caches-misses and the saturation of my L3 cache.

Naive transposition implementations do not write/read data contiguously. The memory access pattern is strided and most cache-lines are wasted. Because of this, data are read/written multiple time from memory on huge matrices (typically 8 times on current x86-64 platforms using double-precision). Tile-based transposition algorithm is an efficient way to prevent this issue. It also strongly reduces cache misses. Ideally, hierarchical tiles should be used or the cache-oblivious Z-tiling pattern although this is hard to implement.


Here is a Numba-based implementation based on the previous informations:

@nb.njit('void(float64[:,::1], float64[:,::1])', parallel=True)
def transpose(mat, out):
    blockSize, tileSize = 256, 32  # To be tuned
    n, m = mat.shape
    assert blockSize % tileSize == 0
    for tmp in nb.prange((m+blockSize-1)//blockSize):
        i = tmp * blockSize
        for j in range(0, n, blockSize):
            tiMin, tiMax = i, min(i+blockSize, m)
            tjMin, tjMax = j, min(j+blockSize, n)
            for ti in range(tiMin, tiMax, tileSize):
                for tj in range(tjMin, tjMax, tileSize):
                    out[ti:ti+tileSize, tj:tj+tileSize] = mat[tj:tj+tileSize, ti:ti+tileSize].T

If you want a faster code, you can use very-optimized native libraries implementing the transposition like the Intel MKL. Such libraries often take advantage of low-level processor-specific instructions (SIMD instruction & non-temporal stores) to use the caches/RAM more efficiently.

Here are the timing results (assuming the output matrix is already filled in memory):

Naive Numpy:                           700 ms
Above code without Numba (sequential): 287 ms
Numba version (sequential):            157 ms
Numba version (parallel):              104 ms
Very-optimized C code (parallel):       68 ms
Theoretical optimum:                    42 ms
Gertiegertrud answered 8/5, 2021 at 1:31 Comment(3)
Thanks! I'm actually not CS background (a chemistry student), so a little hard to follow; but that really helps! It's really tricky to accept the concept of cache-aware tile-based algorithm to me (qaq). After some additional search, I found MKL implements out-place transpose by mkl_?omatcopy, and it's speed is equal or little faster than numba. So it appears to me that numba code is concise and efficient.Karns
I understand that this should be a bit complex without a CS background. I think the main problem initially comes from Numpy itself. To me, Numpy should expose a new parameter in the transpose function to choose if the output should be a view or not (lazy vs eager computation). Indeed, the current view-based implementation is not always inefficient (and benefit from using less memory), but I do not see how Numpy developers can solve this problem only with views. This is why I think the current interface should be modified. I hope developers will extend it in the future.Victuals
Regarding the dependency, one solution would be to mix the use of multiprocessing with the above code without Numba, but I am not sure this will be a lot better since the multiprocessing may introduces a non negligible overhead and makes the code more complex / less clear. Alternatively, you can remove the Numba decorator (and replace nb.prange by range) and still get a speed-up compared to the naive Numpy code (I added a new line for this in the benchmark) although this is not ideal. Another solution could be to replace Numba by Cython if this is fine for you.Victuals

© 2022 - 2024 — McMap. All rights reserved.