How to avoid huge overhead of single-threaded NumPy's transpose?

I currently encounter huge overhead because of NumPy's transpose function. I found this function virtually always run in single-threaded, whatever how large the transposed matrix/array is. I probably need to avoid this huge time cost.

To my understanding, other functions like np.dot or vector increment would run in parallel, if numpy array is large enough. Some element-wise operations seems to be better parallelized in package numexpr, but numexpr probably couldn't handle transpose.

I wish to learn what is the better way to resolve problem. To state this problem in detail,

Sometimes NumPy runs transpose ultrafast (like B = A.T) because the transposed tensor is not used in calculation or be dumped, and there is no need to really transpose data at this stage. When calling B[:] = A.T, that really do transpose of data.
I think a parallelized transpose function should be a resolution. The problem is how to implement it.
Hope the solution does not require packages other than NumPy. ctype binding is acceptable. And hope code is not too difficult to use nor too complicated.
Tensor transpose is a plus. Though techniques to transpose a matrix could be also utilized in specific tensor transpose problem, I think it could be difficult to write a universal API for tensor transpose. I actually also need to handle tensor transpose, but handling tensors could complicate this stackoverflow problem.
And if there's possibility to implement parallelized transpose in future, or there's a plan exists? Then there would be no need to implement transpose by myself ;)

Thanks in advance for any suggestions!

Current workarounds

Handling a model transpose problem (size of A is ~763MB) on my personal computer in Linux with 4-cores available (400% CPU in total).

A = np.random.random((9999, 10001))
B = np.random.random((10001, 9999))
D = np.random.random((9999, 10001))

The current workarounds seems not to be effective enough. Generally, it should see about 3x~4x speedup if fully parallelized on a 4-core CPU, but the code I've written only gains about 1.5x.

Baseline: Naive Transpose (~255 ms)

B[:] = A.T   # ~255 ms, time for transpose
D[:] = A[:]  # ~ 65 ms, time for coping data

It's intersting to find that if A is 10000 * 10000 square matrix, then transpose time will be increased to ~310ms. I don't know what happens here. Even the time cost of C/ctypes binding function will be affected (slower) if matrix is square.

C/ctypes Binding Function (~145 ms)

With the following OpenMP/BLAS (basic, not optimized) written:

// transpose.c
#include <stdio.h>
#include <cblas.h>
void matrix_transpose(double* A, double* B, size_t d1, size_t d2) {
    size_t i;
    #pragma omp parallel for shared(A, B, d1, d2) private(i)
    for (i = 0; i < d1; ++i) {
        cblas_dcopy(d2, A + i*d2, 1, B + i, d1);
    }
}

then execuate python codes (4 core threaded)

from numpy.ctypeslib import as_ctypes
matrix_transpose = np.ctypeslib.load_library("transpose.so", ".").matrix_transpose
matrix_transpose(
    as_ctypes(A), as_ctypes(C),
    ctypes.c_size_t(n1), ctypes.c_size_t(n2))  # ~145 ms

It could be somehow cumbersome to use C/ctype binding, since it is not pure python, and uses CBLAS as external package as well.

Multiprocessing (~165 ms)

nproc = 4
batch = n1 // nproc + 1  # batch 2500 in this case
slices = [slice(i * batch, min(n1, (i+1) * batch)) for i in range(nproc)]

cB = as_ctypes(B)
pB = sharedctypes.RawArray(cB._type_, cB)

def slice_transpose(s):
    B = np.asarray(pB)
    B[:, s] = A[s].T

with Pool(nproc) as pool:
    pool.map(slice_transpose, slices)  # ~165 ms
B = np.asarray(pB)

I guess that for large clusters (HPCs), more processes/threads does not necessarily gains speedup. So how to set the number of processes/threads may also be a problem.

Edit After Initial Question

This problem is not only related to parallelization, but also cache-aware and tile-based algorithm of transpose. Possible solutions could be

Use numba code with tile-based algorithm (stated by answer from Jérôme Richard). Though requires packages other than numpy, it's almost pure python.
Use optimized blas libraries (such as MKL or cuBLAS, which implement their own matrix transpose API) and link to python, instead of CBLAS or BLAS. Need to prepare makefiles or dynamic linked library if this program is to be distributed.
Use pyscf.lib.transpose (python link, c link) for parallel 3-index tensor transpose M.transpose(0,2,1). I'm somehow a fan of pyscf. It's main purpose is quantum or semi-empirical chemistry calculation, but some of it's numerical calculation utilities is probably optimized for this purpose. In my testing on a server, transposing a (1502, 601, 601) tensor could be twice faster (4.09 sec) than MKL mkl_domatcopy (9.19 sec).

Current workarounds

Baseline: Naive Transpose (~255 ms)

C/ctypes Binding Function (~145 ms)

Multiprocessing (~165 ms)

Edit After Initial Question

Recommended topics

Hot tags