Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Asked Answered
K

3

253

After conducting some experiments on square matrices of different sizes, a pattern came up. Invariably, transposing a matrix of size 2^n is slower than transposing one of size 2^n+1. For small values of n, the difference is not major.

Big differences occur however over a value of 512. (at least for me)

Disclaimer: I know the function doesn't actually transpose the matrix because of the double swap of elements, but it makes no difference.

Follows the code:

#define SAMPLES 1000
#define MATSIZE 512

#include <time.h>
#include <iostream>
int mat[MATSIZE][MATSIZE];

void transpose()
{
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
   {
       int aux = mat[i][j];
       mat[i][j] = mat[j][i];
       mat[j][i] = aux;
   }
}

int main()
{
   //initialize matrix
   for ( int i = 0 ; i < MATSIZE ; i++ )
   for ( int j = 0 ; j < MATSIZE ; j++ )
       mat[i][j] = i+j;

   int t = clock();
   for ( int i = 0 ; i < SAMPLES ; i++ )
       transpose();
   int elapsed = clock() - t;

   std::cout << "Average for a matrix of " << MATSIZE << ": " << elapsed / SAMPLES;
}

Changing MATSIZE lets us alter the size (duh!). I posted two versions on ideone:

In my environment (MSVS 2010, full optimizations), the difference is similar :

  • size 512 - average 2.19 ms
  • size 513 - average 0.57 ms

Why is this happening?

Kirbie answered 10/7, 2012 at 13:0 Comment(6)
Your code looks cache unfriendly to me.Coriolanus
It's pretty much the same issue as this question: stackoverflow.com/questions/7905760/…Ellene
Care to elavorate, @CodesInChaos? (Or anyone else.)Cheapjack
@Coriolanus Yeah, I saw what you meant. I usually read through comments first.Cheapjack
I've tried the code with gcc 4.6.1 in ubuntu 11.10,with the default optimization(O0 level), the 513*513 matrix is slower than the 512*512 one,(2.62ms for size 513, and 2.45 ms for size 512), with optimization O1 or above, time for 512 is 2.08ms, and for 513 is 0,56ms. cpu: Intel(R) Core(TM) i3 530 2.93GHz. can anyone explain that?Wendelina
@Wendelina It's kinda pointless to measure anything without optimizations. With optimizations disabled, the generated code will be littered with extraneous garbage that will hide other bottlenecks. (such as memory)Ellene
K
215

The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.

For terms and detailed info, see the wiki entry on caching, I'm gonna narrow it down here.

A cache is organized in sets and lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.

For a particular memory address, we can calculate which set should mirror it with the formula:

set = ( address / lineSize ) % numberOfsets

This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).

It's clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.

I'll try to somewhat follow the example from Agner:

Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800 apart. This is the critical stride (again, for this example).

The critical stride can also be calculated:

criticalStride = numberOfSets * lineSize

Variables spaced criticalStride or a multiple apart contend for the same cache lines.

This is the theory part. Next, the explanation (also Agner, I'm following it closely to avoid making mistakes):

Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).

The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).

Assume we're processing row 28. We're attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they'll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).

When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).

Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we're no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.

Another disclaimer - I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I'm waiting for a response (or confirmation) from Mysticial. :)

Kirbie answered 10/7, 2012 at 13:0 Comment(7)
Oh and next time. Just ping me directly through the Lounge. I don't find every instance of name on SO. :) I only saw this through the periodic email notifications.Ellene
@Ellene @Luchian Grigore One of my friends tells me that his Intel core i3 pc running on Ubuntu 11.04 i386demonstrates almost the same performance with gcc 4.6.And so is the same for my computer Intel Core 2 Duo with mingw gcc4.4,who's running on windows 7(32).It does show a big difference when I compile this segment with a little older pc intel centrino with gcc 4.6,who's running on ubuntu 12.04 i386.Bille
Also note that memory access where the addresses differ by a a multiple of 4096 have a false dependency on Intel SnB-family CPUs. (i.e. same offset within a page). This can reduce throughput when some of the operations are stores, esp. a mix of loads and stores.Devitalize
which goes in set 24 did you mean "in set 28" instead? And do you assume 32 sets?Donelu
You are correct, it's 28. :) I also double-checked the linked paper, for the original explanation you can navigate to 9.2 Cache organizationKirbie
I feel like something is missing. What is the numberOfSets for the first example? It seems to be 0x80 = 128 from the way the numbers work out, but I don't know why it should be.Roehm
The sentence: Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache --- 8kb should be corrected to 8KBSilvertongued
D
106

As an illustration to the explanation in Luchian Grigore's answer, here's what the matrix cache presence looks like for the two cases of 64x64 and 65x65 matrices (see the link above for details on numbers).

Colors in the animations below mean the following:

  • white – not in cache,
  • light-green – in cache,
  • bright green – cache hit,
  • orange – just read from RAM,
  • red – cache miss.

The 64x64 case:

cache presence animation for 64x64 matrix

Notice how almost every access to a new row results in a cache miss. And now how it looks for the normal case, a 65x65 matrix:

cache presence animation for 65x65 matrix

Here you can see that most of the accesses after the initial warming-up are cache hits. This is how CPU cache is intended to work in general.


The code that generated frames for the above animations can be seen here.

Donelu answered 29/12, 2017 at 22:34 Comment(9)
Why are the vertical scanning cache hits not saved in the first case, but they are in the second case? It seems like a given block is accessed exactly once for most blocks in both examples.Roehm
I can see from @LuchianGrigore's answer that it is because all of the lines in the column belong to the same set.Roehm
Yes, great illustration. I see that they are at the same speed. But actually, they are not, aren't they?Duodenary
@Duodenary yes, animation FPS is the same. I didn't simulate slowdown, only colors are important here.Donelu
It would be interesting to have two static images illustrating the different cache sets.Roehm
@Donelu What tool did you use to draw these GIF images?Resistless
@Resistless the simulation frames were rendered by my own program into PNG (via Qt's QImage), and then the sequence was transformed to GIF with ImageMagick's convert tool.Donelu
Great illustration. Quick question, in your code, you are evicting cache line for every memory read, aren't we only starting evicting when the cache is full?Carte
@Carte we only get to this line if we don't get a hit and don't find a free line. Notice the early returns in the loop above.Donelu
O
81

Luchian gives an explanation of why this behavior happens, but I thought it'd be a nice idea to show one possible solution to this problem and at the same time show a bit about cache oblivious algorithms.

Your algorithm basically does:

for (int i = 0; i < N; i++) 
   for (int j = 0; j < N; j++) 
        A[j][i] = A[i][j];

which is just horrible for a modern CPU. One solution is to know the details about your cache system and tweak the algorithm to avoid those problems. Works great as long as you know those details.. not especially portable.

Can we do better than that? Yes we can: A general approach to this problem are cache oblivious algorithms that as the name says avoids being dependent on specific cache sizes [1]

The solution would look like this:

void recursiveTranspose(int i0, int i1, int j0, int j1) {
    int di = i1 - i0, dj = j1 - j0;
    const int LEAFSIZE = 32; // well ok caching still affects this one here
    if (di >= dj && di > LEAFSIZE) {
        int im = (i0 + i1) / 2;
        recursiveTranspose(i0, im, j0, j1);
        recursiveTranspose(im, i1, j0, j1);
    } else if (dj > LEAFSIZE) {
        int jm = (j0 + j1) / 2;
        recursiveTranspose(i0, i1, j0, jm);
        recursiveTranspose(i0, i1, jm, j1);
    } else {
    for (int i = i0; i < i1; i++ )
        for (int j = j0; j < j1; j++ )
            mat[j][i] = mat[i][j];
    }
}

Slightly more complex, but a short test shows something quite interesting on my ancient e8400 with VS2010 x64 release, testcode for MATSIZE 8192

int main() {
    LARGE_INTEGER start, end, freq;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&start);
    recursiveTranspose(0, MATSIZE, 0, MATSIZE);
    QueryPerformanceCounter(&end);
    printf("recursive: %.2fms\n", (end.QuadPart - start.QuadPart) / (double(freq.QuadPart) / 1000));

    QueryPerformanceCounter(&start);
    transpose();
    QueryPerformanceCounter(&end);
    printf("iterative: %.2fms\n", (end.QuadPart - start.QuadPart) / (double(freq.QuadPart) / 1000));
    return 0;
}

results: 
recursive: 480.58ms
iterative: 3678.46ms

Edit: About the influence of size: It is much less pronounced although still noticeable to some degree, that's because we're using the iterative solution as a leaf node instead of recursing down to 1 (the usual optimization for recursive algorithms). If we set LEAFSIZE = 1, the cache has no influence for me [8193: 1214.06; 8192: 1171.62ms, 8191: 1351.07ms - that's inside the margin of error, the fluctuations are in the 100ms area; this "benchmark" isn't something that I'd be too comfortable with if we wanted completely accurate values])

[1] Sources for this stuff: Well if you can't get a lecture from someone that worked with Leiserson and co on this.. I assume their papers a good starting point. Those algorithms are still quite rarely described - CLR has a single footnote about them. Still it's a great way to surprise people.


Edit (note: I'm not the one who posted this answer; I just wanted to add this):
Here's a complete C++ version of the above code:

template<class InIt, class OutIt>
void transpose(InIt const input, OutIt const output,
    size_t const rows, size_t const columns,
    size_t const r1 = 0, size_t const c1 = 0,
    size_t r2 = ~(size_t) 0, size_t c2 = ~(size_t) 0,
    size_t const leaf = 0x20)
{
    if (!~c2) { c2 = columns - c1; }
    if (!~r2) { r2 = rows - r1; }
    size_t const di = r2 - r1, dj = c2 - c1;
    if (di >= dj && di > leaf)
    {
        transpose(input, output, rows, columns, r1, c1, (r1 + r2) / 2, c2);
        transpose(input, output, rows, columns, (r1 + r2) / 2, c1, r2, c2);
    }
    else if (dj > leaf)
    {
        transpose(input, output, rows, columns, r1, c1, r2, (c1 + c2) / 2);
        transpose(input, output, rows, columns, r1, (c1 + c2) / 2, r2, c2);
    }
    else
    {
        for (ptrdiff_t i1 = (ptrdiff_t) r1, i2 = (ptrdiff_t) (i1 * columns);
            i1 < (ptrdiff_t) r2; ++i1, i2 += (ptrdiff_t) columns)
        {
            for (ptrdiff_t j1 = (ptrdiff_t) c1, j2 = (ptrdiff_t) (j1 * rows);
                j1 < (ptrdiff_t) c2; ++j1, j2 += (ptrdiff_t) rows)
            {
                output[j2 + i1] = input[i2 + j1];
            }
        }
    }
}
Outstation answered 10/7, 2012 at 13:26 Comment(9)
This would be relevant if you compared the times between matrices of different sizes, not recursive and iterative. Try the recursive solution on a matrix of the sizes specified.Kirbie
@Luchian Since you already explained why he's seeing the behavior I thought it quite interesting to introduce one solution to this problem in general.Outstation
Because, I'm questioning why a larger matrix takes a shorter time to process, not looking for a speedier algorithm...Kirbie
@Luchian The differences between 16383 and 16384 are.. 28 vs 27ms for me here, or a about 3.5% - not really significant. And I'd be surprised if it were.Outstation
@Luchian Why wouldn't 16383 count? 16384 is a multiple of 2 and 512, 16383 should mess the critical stride up, shouldn't it? With the iterative version I get a much lower result for 511 vs. 512 as well and that's way more than the bit less work it has to do.Outstation
Ah, right about that. What about if you run 16383 and 16384 with the iterative solution? Is the difference much greater?Kirbie
@Luchian 4541.32ms vs. 15010.51ms as expected. I mean clearly using the LEAFSIZE constant means my algorithm isn't completely cache oblivious (but obviously faster than recursing down to 1) so I'd expect some small effect as well. Increasing the leafsize should make that effect more dramatic though, since in the end we'd implement the iterative solution. I just thought an algorithm that would be not/much less influenced by caching would be a good fit for this question as an addendum to your answer.Outstation
It might be interesting to explain what the recursiveTranspose does, ie that it does not fill up the cache as much by operating on small tiles (of LEAFSIZE x LEAFSIZE dimension).Coverage
Also, what happens if LEAFSIZE is a critical stride.Kirbie

© 2022 - 2024 — McMap. All rights reserved.