Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

Asked 10/7, 2012 at 13:0 Answered 29/12, 2017 at 22:34

253

After conducting some experiments on square matrices of different sizes, a pattern came up. Invariably, transposing a matrix of size 2^n is slower than transposing one of size 2^n+1. For small values of n, the difference is not major.

Big differences occur however over a value of 512. (at least for me)

Disclaimer: I know the function doesn't actually transpose the matrix because of the double swap of elements, but it makes no difference.

Follows the code:

#define SAMPLES 1000
#define MATSIZE 512

#include <time.h>
#include <iostream>
int mat[MATSIZE][MATSIZE];

void transpose()
{
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
   {
       int aux = mat[i][j];
       mat[i][j] = mat[j][i];
       mat[j][i] = aux;
   }
}

int main()
{
   //initialize matrix
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
       mat[i][j] = i+j;

   int t = clock();
   for ( int i = 0 ; i < SAMPLES ; i++ )
       transpose();
   int elapsed = clock() - t;

   std::cout << "Average for a matrix of " << MATSIZE << ": " << elapsed / SAMPLES;
}

Changing MATSIZE lets us alter the size (duh!). I posted two versions on ideone:

size 512 - average 2.46 ms - http://ideone.com/1PV7m
size 513 - average 0.75 ms - http://ideone.com/NShpo

In my environment (MSVS 2010, full optimizations), the difference is similar :

size 512 - average 2.19 ms
size 513 - average 0.57 ms

Why is this happening?

Kirbie answered 10/7, 2012 at 13:0 Comment(6)

Your code looks cache unfriendly to me. – Coriolanus 10/7, 2012 at 13:2

It's pretty much the same issue as this question: stackoverflow.com/questions/7905760/… – Ellene 10/7, 2012 at 13:30

Care to elavorate, @CodesInChaos? (Or anyone else.) – Cheapjack 12/9, 2012 at 8:44

@Coriolanus Yeah, I saw what you meant. I usually read through comments first. – Cheapjack 12/9, 2012 at 9:33

I've tried the code with gcc 4.6.1 in ubuntu 11.10,with the default optimization(O0 level), the 513*513 matrix is slower than the 512*512 one,(2.62ms for size 513, and 2.45 ms for size 512), with optimization O1 or above, time for 512 is 2.08ms, and for 513 is 0,56ms. cpu: Intel(R) Core(TM) i3 530 2.93GHz. can anyone explain that? – Wendelina 27/9, 2012 at 11:30

@Wendelina It's kinda pointless to measure anything without optimizations. With optimizations disabled, the generated code will be littered with extraneous garbage that will hide other bottlenecks. (such as memory) – Ellene 4/10, 2012 at 7:52

215

The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.

For terms and detailed info, see the wiki entry on caching, I'm gonna narrow it down here.

A cache is organized in sets and lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.

For a particular memory address, we can calculate which set should mirror it with the formula:

set = ( address / lineSize ) % numberOfsets

This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).

It's clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.

I'll try to somewhat follow the example from Agner:

Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800 apart. This is the critical stride (again, for this example).

The critical stride can also be calculated:

criticalStride = numberOfSets * lineSize

Variables spaced criticalStride or a multiple apart contend for the same cache lines.

This is the theory part. Next, the explanation (also Agner, I'm following it closely to avoid making mistakes):

Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).

The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).

Assume we're processing row 28. We're attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they'll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).

When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).

Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we're no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.

Another disclaimer - I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I'm waiting for a response (or confirmation) from Mysticial. :)

Kirbie answered 10/7, 2012 at 13:0 Comment(7)

Oh and next time. Just ping me directly through the Lounge. I don't find every instance of name on SO. :) I only saw this through the periodic email notifications. – Ellene 10/7, 2012 at 13:39

@Ellene @Luchian Grigore One of my friends tells me that his Intel core i3 pc running on Ubuntu 11.04 i386demonstrates almost the same performance with gcc 4.6.And so is the same for my computer Intel Core 2 Duo with mingw gcc4.4,who's running on windows 7(32).It does show a big difference when I compile this segment with a little older pc intel centrino with gcc 4.6,who's running on ubuntu 12.04 i386. – Bille 27/9, 2012 at 1:58

Also note that memory access where the addresses differ by a a multiple of 4096 have a false dependency on Intel SnB-family CPUs. (i.e. same offset within a page). This can reduce throughput when some of the operations are stores, esp. a mix of loads and stores. – Devitalize 18/3, 2016 at 1:52

which goes in set 24 did you mean "in set 28" instead? And do you assume 32 sets? – Donelu 2/4, 2016 at 18:4

You are correct, it's 28. :) I also double-checked the linked paper, for the original explanation you can navigate to 9.2 Cache organization – Kirbie 4/4, 2016 at 15:0

I feel like something is missing. What is the numberOfSets for the first example? It seems to be 0x80 = 128 from the way the numbers work out, but I don't know why it should be. – Roehm 21/9, 2018 at 0:50

The sentence: Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache --- 8kb should be corrected to 8KB – Silvertongued 17/10, 2023 at 9:25

106

As an illustration to the explanation in Luchian Grigore's answer, here's what the matrix cache presence looks like for the two cases of 64x64 and 65x65 matrices (see the link above for details on numbers).

Colors in the animations below mean the following:

– not in cache,
– in cache,
– cache hit,
– just read from RAM,
– cache miss.

The 64x64 case:

Notice how almost every access to a new row results in a cache miss. And now how it looks for the normal case, a 65x65 matrix:

Here you can see that most of the accesses after the initial warming-up are cache hits. This is how CPU cache is intended to work in general.

_{The code that generated frames for the above animations can be seen here.}

Donelu answered 29/12, 2017 at 22:34 Comment(9)

Why are the vertical scanning cache hits not saved in the first case, but they are in the second case? It seems like a given block is accessed exactly once for most blocks in both examples. – Roehm 21/9, 2018 at 0:16

I can see from @LuchianGrigore's answer that it is because all of the lines in the column belong to the same set. – Roehm 21/9, 2018 at 0:51

Yes, great illustration. I see that they are at the same speed. But actually, they are not, aren't they? – Duodenary 21/9, 2018 at 7:50

@Duodenary yes, animation FPS is the same. I didn't simulate slowdown, only colors are important here. – Donelu 21/9, 2018 at 9:27

It would be interesting to have two static images illustrating the different cache sets. – Roehm 21/9, 2018 at 13:9

@Donelu What tool did you use to draw these GIF images? – Resistless 4/4, 2020 at 6:26

@Resistless the simulation frames were rendered by my own program into PNG (via Qt's QImage), and then the sequence was transformed to GIF with ImageMagick's convert tool. – Donelu 4/4, 2020 at 7:48

Great illustration. Quick question, in your code, you are evicting cache line for every memory read, aren't we only starting evicting when the cache is full? – Carte 1/1, 2021 at 18:46

@Carte we only get to this line if we don't get a hit and don't find a free line. Notice the early returns in the loop above. – Donelu 1/1, 2021 at 19:4

Luchian gives an explanation of why this behavior happens, but I thought it'd be a nice idea to show one possible solution to this problem and at the same time show a bit about cache oblivious algorithms.

Your algorithm basically does:

for (int i = 0; i < N; i++) 
   for (int j = 0; j < N; j++) 
        A[j][i] = A[i][j];

which is just horrible for a modern CPU. One solution is to know the details about your cache system and tweak the algorithm to avoid those problems. Works great as long as you know those details.. not especially portable.

Can we do better than that? Yes we can: A general approach to this problem are cache oblivious algorithms that as the name says avoids being dependent on specific cache sizes [1]

The solution would look like this:

void recursiveTranspose(int i0, int i1, int j0, int j1) {
    int di = i1 - i0, dj = j1 - j0;
    const int LEAFSIZE = 32; // well ok caching still affects this one here
    if (di >= dj && di > LEAFSIZE) {
        int im = (i0 + i1) / 2;
        recursiveTranspose(i0, im, j0, j1);
        recursiveTranspose(im, i1, j0, j1);
    } else if (dj > LEAFSIZE) {
        int jm = (j0 + j1) / 2;
        recursiveTranspose(i0, i1, j0, jm);
        recursiveTranspose(i0, i1, jm, j1);
    } else {
    for (int i = i0; i < i1; i++ )
        for (int j = j0; j < j1; j++ )
            mat[j][i] = mat[i][j];
    }
}

Slightly more complex, but a short test shows something quite interesting on my ancient e8400 with VS2010 x64 release, testcode for MATSIZE 8192

int main() {
    LARGE_INTEGER start, end, freq;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&start);
    recursiveTranspose(0, MATSIZE, 0, MATSIZE);
    QueryPerformanceCounter(&end);
    printf("recursive: %.2fms\n", (end.QuadPart - start.QuadPart) / (double(freq.QuadPart) / 1000));

    QueryPerformanceCounter(&start);
    transpose();
    QueryPerformanceCounter(&end);
    printf("iterative: %.2fms\n", (end.QuadPart - start.QuadPart) / (double(freq.QuadPart) / 1000));
    return 0;
}

results: 
recursive: 480.58ms
iterative: 3678.46ms

Edit: About the influence of size: It is much less pronounced although still noticeable to some degree, that's because we're using the iterative solution as a leaf node instead of recursing down to 1 (the usual optimization for recursive algorithms). If we set LEAFSIZE = 1, the cache has no influence for me [8193: 1214.06; 8192: 1171.62ms, 8191: 1351.07ms - that's inside the margin of error, the fluctuations are in the 100ms area; this "benchmark" isn't something that I'd be too comfortable with if we wanted completely accurate values])

[1] Sources for this stuff: Well if you can't get a lecture from someone that worked with Leiserson and co on this.. I assume their papers a good starting point. Those algorithms are still quite rarely described - CLR has a single footnote about them. Still it's a great way to surprise people.

Edit (note: I'm not the one who posted this answer; I just wanted to add this):
Here's a complete C++ version of the above code:

template<class InIt, class OutIt>
void transpose(InIt const input, OutIt const output,
    size_t const rows, size_t const columns,
    size_t const r1 = 0, size_t const c1 = 0,
    size_t r2 = ~(size_t) 0, size_t c2 = ~(size_t) 0,
    size_t const leaf = 0x20)
{
    if (!~c2) { c2 = columns - c1; }
    if (!~r2) { r2 = rows - r1; }
    size_t const di = r2 - r1, dj = c2 - c1;
    if (di >= dj && di > leaf)
    {
        transpose(input, output, rows, columns, r1, c1, (r1 + r2) / 2, c2);
        transpose(input, output, rows, columns, (r1 + r2) / 2, c1, r2, c2);
    }
    else if (dj > leaf)
    {
        transpose(input, output, rows, columns, r1, c1, r2, (c1 + c2) / 2);
        transpose(input, output, rows, columns, r1, (c1 + c2) / 2, r2, c2);
    }
    else
    {
        for (ptrdiff_t i1 = (ptrdiff_t) r1, i2 = (ptrdiff_t) (i1 * columns);
            i1 < (ptrdiff_t) r2; ++i1, i2 += (ptrdiff_t) columns)
        {
            for (ptrdiff_t j1 = (ptrdiff_t) c1, j2 = (ptrdiff_t) (j1 * rows);
                j1 < (ptrdiff_t) c2; ++j1, j2 += (ptrdiff_t) rows)
            {
                output[j2 + i1] = input[i2 + j1];
            }
        }
    }
}

Outstation answered 10/7, 2012 at 13:26 Comment(9)

This would be relevant if you compared the times between matrices of different sizes, not recursive and iterative. Try the recursive solution on a matrix of the sizes specified. – Kirbie 10/7, 2012 at 13:28

@Luchian Since you already explained why he's seeing the behavior I thought it quite interesting to introduce one solution to this problem in general. – Outstation 10/7, 2012 at 13:32

Because, I'm questioning why a larger matrix takes a shorter time to process, not looking for a speedier algorithm... – Kirbie 10/7, 2012 at 13:34

@Luchian The differences between 16383 and 16384 are.. 28 vs 27ms for me here, or a about 3.5% - not really significant. And I'd be surprised if it were. – Outstation 10/7, 2012 at 13:36

@Luchian Why wouldn't 16383 count? 16384 is a multiple of 2 and 512, 16383 should mess the critical stride up, shouldn't it? With the iterative version I get a much lower result for 511 vs. 512 as well and that's way more than the bit less work it has to do. – Outstation 10/7, 2012 at 13:47

Ah, right about that. What about if you run 16383 and 16384 with the iterative solution? Is the difference much greater? – Kirbie 10/7, 2012 at 13:51

@Luchian 4541.32ms vs. 15010.51ms as expected. I mean clearly using the LEAFSIZE constant means my algorithm isn't completely cache oblivious (but obviously faster than recursing down to 1) so I'd expect some small effect as well. Increasing the leafsize should make that effect more dramatic though, since in the end we'd implement the iterative solution. I just thought an algorithm that would be not/much less influenced by caching would be a good fit for this question as an addendum to your answer. – Outstation 10/7, 2012 at 14:0

It might be interesting to explain what the recursiveTranspose does, ie that it does not fill up the cache as much by operating on small tiles (of LEAFSIZE x LEAFSIZE dimension). – Coverage 10/7, 2012 at 14:52

Also, what happens if LEAFSIZE is a critical stride. – Kirbie 10/7, 2012 at 15:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags