A Cache Efficient Matrix Transpose Program?

Asked 4/3, 2011 at 23:17 Answered 21/7, 2023 at 16:36

So the obvious way to transpose a matrix is to use :

  for( int i = 0; i < n; i++ )

    for( int j = 0; j < n; j++ )

      destination[j+i*n] = source[i+j*n];

but I want something that will take advantage of locality and cache blocking. I was looking it up and can't find code that would do this, but I'm told it should be a very simple modification to the original. Any ideas?

Edit: I have a 2000x2000 matrix, and I want to know how can I change the code using two for loops, basically splitting the matrix into blocks that I transpose individually, say 2x2 blocks, or 40x40 blocks, and see which block size is most efficient.

Edit2: The matrices are stored in column major order, that is to say for a matrix

a1 a2    
a3 a4

is stored as a1 a3 a2 a4.

Birl answered 4/3, 2011 at 23:17 Comment(1)

Beware because a really hot modern coompiler might well sort this out for you by using powerful optimisations that completely redesign the code. Check out the keyword restrict. Take a look at all optimisation switches eg for gcc for example must start with using -O3 or -O2, whichever is fastest, and also must use switches to enable support for your modern cpu's full instruction set eg for gcc -march=haswell. Don't use crap compilers. Gcc the Intel ones and llvm/clang all are good. – Hellen 28/8, 2017 at 6:10

You're probably going to want four loops - two to iterate over the blocks, and then another two to perform the transpose-copy of a single block. Assuming for simplicity a block size that divides the size of the matrix, something like this I think, although I'd want to draw some pictures on the backs of envelopes to be sure:

for (int i = 0; i < n; i += blocksize) {
    for (int j = 0; j < n; j += blocksize) {
        // transpose the block beginning at [i,j]
        for (int k = i; k < i + blocksize; ++k) {
            for (int l = j; l < j + blocksize; ++l) {
                dst[k + l*n] = src[l + k*n];
            }
        }
    }
}

An important further insight is that there's actually a cache-oblivious algorithm for this (see http://en.wikipedia.org/wiki/Cache-oblivious_algorithm, which uses this exact problem as an example). The informal definition of "cache-oblivious" is that you don't need to experiment tweaking any parameters (in this case the blocksize) in order to hit good/optimal cache performance. The solution in this case is to transpose by recursively dividing the matrix in half, and transposing the halves into their correct position in the destination.

Whatever the cache size actually is, this recursion takes advantage of it. I expect there's a bit of extra management overhead compared with your strategy, which is to use performance experiments to, in effect, jump straight to the point in the recursion at which the cache really kicks in, and go no further. On the other hand, your performance experiments might give you an answer that works on your machine but not on your customers' machines.

Trappist answered 4/3, 2011 at 23:41 Comment(12)

So if we have a 7 by 7 matrix with a blocksize of 2 will this code break, edge-case wise? – Birl 4/3, 2011 at 23:47

@user635832: yes it will, because the bounds on k and l are i+2 and j+2, which for the blocks all around the right-bottom edges will overrun the actual matrix by a margin of 1 because i or j will be 6. Simplest fix is to bound k by k < i+blocksize && k < n, and the same for l. Unless I've missed something or messed it up - I haven't tested this code at all. – Trappist 4/3, 2011 at 23:50

It's weird, I was testing this code, and it seems to work, but actually runs slower with the cache efficiency then without. Maybe 2000x2000 matrices are too small for the efficiency gain? – Birl 4/3, 2011 at 23:59

@user63582: You don't say the type of the matrix, but at 16MB a good proportion of the matrix fits in L3 cache on a typical intel-based machine, so yes, could be too small. Or maybe I've got it wrong, and this block-wise transposition isn't that great for cache. I think it should be, with a blocksize of around 64 bytes squared, but I may be confused or it may be the compiler does a good enough job with the original code and memory pre-fetching. Are you at max optimization, and do you have a profiler on that counts cache misses, to prove whether it's actually cache-efficient or not? – Trappist 5/3, 2011 at 0:0

@Steve It seems that once I get to 8000x8000 matrix the efficiency for naive solution becomes slower than the cache efficient one. Edit: I don't have a profiler yet. I might be able to set one up. – Birl 5/3, 2011 at 0:2

Oh, and if this is implemented as a function, have you marked the pointer parameters src and dst as restrict? If you have then at least in principle, the compiler can re-order the copies for itself in ways that it can't do when there's a possibility that src and dst overlap. Chances are, it knows more about your hardware than you do. – Trappist 5/3, 2011 at 0:9

Correction: I think there was some problem compiling. The solution you provided is 4 times faster :) – Birl 5/3, 2011 at 1:59

@SteveJessop, I got a big speed up using your suggestion along with SSE. I posted my results here (see the end of my answer) stackoverflow.com/questions/16737298/… Thank You! – Othello 4/6, 2013 at 9:50

Why is this faster than a naive transpose? From what i've learnt about caches (in the last 3 days so please be nice), the destination write should be happening in the cache (in the naive algorithm), whereas the read access cannot be in the cache (since we're accessing it in steps of n, unless n < blocksize). Whether we're in the optimized case or the naive case, we're stepping through only one of the two matrices in offsets of 1 (which will use the cache due to spatial locality), the other matrix in both cases will be accessed in steps of n, which will result in a cache miss everytime. Right? – Terramycin 20/7, 2016 at 7:31

@user2635088: hard to explain this without waving my arms around, but the trick is that on whichever side you're making steps of n, you want to provoke m cache misses, where m*n fits in cache. Then, you go back to where you started, shuffle over one place, and do another m steps of n, which now will not provoke cache misses on that side, because they're all adjacent to the stuff you just cached. So the goal is to get m small enough. You can do this by manually choosing it, or you can use the cache-oblivious algorithm which ensures that most of the work is done using small m. – Trappist 20/7, 2016 at 10:28

@Steve I think the input matrix has to be also in the right dimensions, if not than your blocks might be actually in different cache lines. – Regalia 25/11, 2016 at 13:43

Can you please explain how does this work with the dst matrix? Because you are writing in column order so the reading of the blocks is cache friendly but writing to the dst matrix isn't... yet it still is a lot faster than the naive solution – Regalia 25/11, 2016 at 13:45

I had the exact same problem yesterday. I ended up with this solution:

void transpose(double *dst, const double *src, size_t n, size_t p) noexcept {
    THROWS();
    size_t block = 32;
    for (size_t i = 0; i < n; i += block) {
        for(size_t j = 0; j < p; ++j) {
            for(size_t b = 0; b < block && i + b < n; ++b) {
                dst[j*n + i + b] = src[(i + b)*p + j];
            }
        }
    }
}

This is 4 time faster than the obvious solution on my machine.

This solution takes care of a rectangular matrix with dimensions which are not a multiple of the block size.

if dst and src are the same square matrix an in place function should really be used instead:

void transpose(double*m,size_t n)noexcept{
    size_t block=0,size=8;
    for(block=0;block+size-1<n;block+=size){
        for(size_t i=block;i<block+size;++i){
            for(size_t j=i+1;j<block+size;++j){
                std::swap(m[i*n+j],m[j*n+i]);}}
        for(size_t i=block+size;i<n;++i){
            for(size_t j=block;j<block+size;++j){
                std::swap(m[i*n+j],m[j*n+i]);}}}
    for(size_t i=block;i<n;++i){
        for(size_t j=i+1;j<n;++j){
            std::swap(m[i*n+j],m[j*n+i]);}}}

I used C++11 but this could be easily translated in other languages.

Shayna answered 4/2, 2014 at 9:9 Comment(3)

In the first function where do you use "p"? – Cestode 21/10, 2017 at 11:7

Should this line "dst[j*n + i + b] = src[(i + b)*n + j];" remain unchaged? – Cestode 21/10, 2017 at 15:7

@Cestode You were absolutely correct again. I fixed the line. – Shayna 23/10, 2017 at 21:24

Instead of transposing the matrix in memory, why not collapse the transposition operation into the next operation you're going to do on the matrix?

Orgy answered 5/3, 2011 at 1:17 Comment(3)

Definitely worth considering. One nice way is to make an object that presents a "view" over the original matrix, like a view in a database. – Heterologous 5/3, 2011 at 4:31

More than simply worth considering, @j_random_hacker. A "view" object in the short run decouples representation from intent, which is generally a good design strategy. It also allows you to build towards letting the computer figure out good algorithms for things in a manner similar to what a database query optimizers do. – Galliot 8/6, 2016 at 14:59

This is good advice. A very minor performance warning: Implementing views might well end up with introducing calls through function pointers rather than calls to known fixed function addresses. Calling through unknown variable pointers can be very slow on some modern machines. – Hellen 28/8, 2017 at 6:2

Steve Jessop mentioned a cache oblivious matrix transpose algorithm. For the record, I want to share an possible implementation of a cache oblivious matrix transpose.

public class Matrix {
    protected double data[];
    protected int rows, columns;

    public Matrix(int rows, int columns) {
        this.rows = rows;
        this.columns = columns;
        this.data = new double[rows * columns];
    }

    public Matrix transpose() {
        Matrix C = new Matrix(columns, rows);
        cachetranspose(0, rows, 0, columns, C);
        return C;
    }

    public void cachetranspose(int rb, int re, int cb, int ce, Matrix T) {
        int r = re - rb, c = ce - cb;
        if (r <= 16 && c <= 16) {
            for (int i = rb; i < re; i++) {
                for (int j = cb; j < ce; j++) {
                    T.data[j * rows + i] = data[i * columns + j];
                }
            }
        } else if (r >= c) {
            cachetranspose(rb, rb + (r / 2), cb, ce, T);
            cachetranspose(rb + (r / 2), re, cb, ce, T);
        } else {
            cachetranspose(rb, re, cb, cb + (c / 2), T);
            cachetranspose(rb, re, cb + (c / 2), ce, T);
        }
    }
}

More details on cache oblivious algorithms can be found here.

Kahaleel answered 19/1, 2015 at 14:36 Comment(3)

The link for additional information results in a 403 Forbidden error. – Tankersley 28/2, 2015 at 22:50

Thanks, I fixed the broken link. – Kahaleel 2/3, 2015 at 10:11

Link broken again. – Tremor 12/9, 2017 at 8:48

Matrix multiplication comes to mind, but the cache issue there is much more pronounced, because each element is read N times.

With matrix transpose, you are reading in a single linear pass and there's no way to optimize that. But you can simultaneously process several rows so that you write several columns and so fill complete cache lines. You will only need three loops.

Or do it the other way around and read in columns while writing linearly.

Purkey answered 4/3, 2011 at 23:48 Comment(1)

Good advice indeed. I'm not sure which way round is fastest, depends on machine architecture. – Hellen 28/8, 2017 at 6:4

With a large matrix, possibly a large sparse matrix, it might be an idea to decompose it into smaller cache friendly chunks (Say, 4x4 sub matrices). You can also flag sub matrices as identity which will help you in creating optimized code paths.

Emilie answered 5/3, 2011 at 11:56 Comment(0)

Here is a sample C++ code for cache friendly parallelized matrix transposition:

constexpr auto range(unsigned max) 
    { return std::views::iota(0u, max); }
constexpr auto range(unsigned start, unsigned max) 
    { return std::views::iota(start, max); }

constexpr auto SZ = 1000u;
std::array<std::array<int, SZ>, SZ> mx;

void testtranspose(void)
{
    for (int i = 0; i < SZ; i++) 
        for (int j = 0; j < SZ; j++)
            mx[i][j] = j;

    // ***************
    //       cache friendly in place parallel transposition
    constexpr unsigned block_size = 10; assert(SZ % block_size == 0);
    std::for_each(std::execution::par_unseq,
        range(SZ / block_size).begin(), range(SZ / block_size).end(), 
        [&](const unsigned iib)
        {   //cache friendly transposition
            const unsigned ii = iib * block_size;
            //transpose the diagonal block
            for (unsigned i = ii; i < ii + block_size; i++)
                for (unsigned j = i + 1; j < ii + block_size; j++)
                    std::swap(mx[i][j], mx[j][i]);
            //transpose the rest of blocks
            for (unsigned jj = ii + block_size; jj < SZ; jj += block_size)
                for (unsigned i = ii; i < ii + block_size; i++)
                    for (unsigned j = jj; j < jj + block_size; j++)
                        std::swap(mx[i][j], mx[j][i]);
        });
    // ***************

    for (int i = 0; i < SZ; i++)
        for (int j = 0; j < SZ; j++)
            assert(mx[i][j] == i);
}

Vaios answered 21/7, 2023 at 16:36 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Turk 28/7, 2023 at 13:59

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags