I currently encounter huge overhead because of NumPy's transpose function. I found this function virtually always run in single-threaded, whatever how large the transposed matrix/array is. I probably need to avoid this huge time cost.
To my understanding, other functions like np.dot
or vector increment would run in parallel, if numpy array is large enough. Some element-wise operations seems to be better parallelized in package numexpr, but numexpr probably couldn't handle transpose.
I wish to learn what is the better way to resolve problem. To state this problem in detail,
- Sometimes NumPy runs transpose ultrafast (like
B = A.T
) because the transposed tensor is not used in calculation or be dumped, and there is no need to really transpose data at this stage. When callingB[:] = A.T
, that really do transpose of data. - I think a parallelized transpose function should be a resolution. The problem is how to implement it.
- Hope the solution does not require packages other than NumPy. ctype binding is acceptable. And hope code is not too difficult to use nor too complicated.
- Tensor transpose is a plus. Though techniques to transpose a matrix could be also utilized in specific tensor transpose problem, I think it could be difficult to write a universal API for tensor transpose. I actually also need to handle tensor transpose, but handling tensors could complicate this stackoverflow problem.
- And if there's possibility to implement parallelized transpose in future, or there's a plan exists? Then there would be no need to implement transpose by myself ;)
Thanks in advance for any suggestions!
Current workarounds
Handling a model transpose problem (size of A
is ~763MB) on my personal computer in Linux with 4-cores available (400% CPU in total).
A = np.random.random((9999, 10001))
B = np.random.random((10001, 9999))
D = np.random.random((9999, 10001))
The current workarounds seems not to be effective enough. Generally, it should see about 3x~4x speedup if fully parallelized on a 4-core CPU, but the code I've written only gains about 1.5x.
Baseline: Naive Transpose (~255 ms)
B[:] = A.T # ~255 ms, time for transpose
D[:] = A[:] # ~ 65 ms, time for coping data
It's intersting to find that if A
is 10000 * 10000 square matrix, then transpose time will be increased to ~310ms. I don't know what happens here. Even the time cost of C/ctypes binding function will be affected (slower) if matrix is square.
C/ctypes Binding Function (~145 ms)
With the following OpenMP/BLAS (basic, not optimized) written:
// transpose.c
#include <stdio.h>
#include <cblas.h>
void matrix_transpose(double* A, double* B, size_t d1, size_t d2) {
size_t i;
#pragma omp parallel for shared(A, B, d1, d2) private(i)
for (i = 0; i < d1; ++i) {
cblas_dcopy(d2, A + i*d2, 1, B + i, d1);
}
}
then execuate python codes (4 core threaded)
from numpy.ctypeslib import as_ctypes
matrix_transpose = np.ctypeslib.load_library("transpose.so", ".").matrix_transpose
matrix_transpose(
as_ctypes(A), as_ctypes(C),
ctypes.c_size_t(n1), ctypes.c_size_t(n2)) # ~145 ms
It could be somehow cumbersome to use C/ctype binding, since it is not pure python, and uses CBLAS as external package as well.
Multiprocessing (~165 ms)
nproc = 4
batch = n1 // nproc + 1 # batch 2500 in this case
slices = [slice(i * batch, min(n1, (i+1) * batch)) for i in range(nproc)]
cB = as_ctypes(B)
pB = sharedctypes.RawArray(cB._type_, cB)
def slice_transpose(s):
B = np.asarray(pB)
B[:, s] = A[s].T
with Pool(nproc) as pool:
pool.map(slice_transpose, slices) # ~165 ms
B = np.asarray(pB)
I guess that for large clusters (HPCs), more processes/threads does not necessarily gains speedup. So how to set the number of processes/threads may also be a problem.
Edit After Initial Question
This problem is not only related to parallelization, but also cache-aware and tile-based algorithm of transpose. Possible solutions could be
- Use numba code with tile-based algorithm (stated by answer from Jérôme Richard). Though requires packages other than numpy, it's almost pure python.
- Use optimized blas libraries (such as MKL or cuBLAS, which implement their own matrix transpose API) and link to python, instead of CBLAS or BLAS. Need to prepare makefiles or dynamic linked library if this program is to be distributed.
- Use
pyscf.lib.transpose
(python link, c link) for parallel 3-index tensor transposeM.transpose(0,2,1)
. I'm somehow a fan of pyscf. It's main purpose is quantum or semi-empirical chemistry calculation, but some of it's numerical calculation utilities is probably optimized for this purpose. In my testing on a server, transposing a (1502, 601, 601) tensor could be twice faster (4.09 sec) than MKLmkl_domatcopy
(9.19 sec).
Related algorithm article:
Related stackoverflow and github issue pages:
A.T
returns aview
. The time consuming operations require acopy
You can't address this issue without a clear understanding of the difference. – BlowgunD[:]=A[:]
~65 ms, compared to the transpose time ~265ms. The reason not to useD=A.copy()
to benchmark copy time, is because that will allocate a new memory buffer which costs additional time. Also,B=A.T
is also tested costs virtually no time, as expected. – KarnsD[:]=A[:]
can copy contiguous blocks from both source and destination, whileB[:]=A.T
does not have a contiguous buffer at one end. – Odysseus