Cache Memory Optimization Array Transpose: C

typedef int array[2][2];

void transpose(array dst, array src) {
    int i, j;
    for (j = 0; j < 2; j++) {
        for (i = 0; i < 2; i++) {
            dst[i][j] = src[j][i];
        }
    }
}

src array starts at address 0 and the dst array starts at address 0x10.

L1 data cache, direct map, write-allocate, 8 byte block size.

cache total size is 16 data bytes.

What is the hit or miss on each entry of src and dst array?

The answer is:

src: 
[0][0] -> miss,
[0][1] -> miss,
[1][0] -> miss,
[1][1] -> hit

dst:
[0][0] -> miss,
[0][1] -> miss,
[1][0] -> miss,
[1][1] -> miss

If the cache total size is 32 data bytes, the answer is:

src: 
[0][0] -> miss,
[0][1] -> hit,
[1][0] -> miss,
[1][1] -> hit

dst:
[0][0] -> miss,
[0][1] -> hit,
[1][0] -> miss,
[1][1] -> hit

I am unsure of both outcomes. I don't really understand the concept with arrays and caching.

So, in the first instance you have two cache lines of 8 bytes each for a total of 16 bytes. I'll assume an int data size of 4 bytes. Given the placement of arrays in C and the offsets you've provided these are the memory lines which can be cached:

Cacheable lines:
#A: &src[0][0] = 0x00, &src[0][1] = 0x04
#B: &src[1][0] = 0x08, &src[1][1] = 0x0C
#C: &dst[0][0] = 0x10, &dst[0][1] = 0x14
#D: &dst[1][0] = 0x18, &dst[1][1] = 0x1C

Then we need to know the access order that each memory address is visited by the program. I'm assuming no optimizations which might cause reorderings by the compiler.

Access order and cache behavior (assuming initially empty):
#1: load src[0][0] --> Miss line A = cache slot 1
#2: save dst[0][0] --> Miss line C = cache slot 2
#3: load src[0][1] --> Hit  line A = cache slot 1
#4: save dst[0][1] --> Hit  line C = cache slot 2
#5: load src[1][0] --> Miss line B = cache slot 1 (LRU, replaces line A)
#6: save dst[1][0] --> Miss line D = cache slot 2 (LRU, replaces line C)
#7: load src[1][1] --> Hit  line B = cache slot 1
#8: save dst[1][1] --> Hit  line D = cache slot 2

Which, I think, matches your second answer. Then with a cache size of 32 bytes (4 lines), assuming all other factors are constant:

Access order and cache behavior (assuming initially empty):
#1: load src[0][0] --> Miss line A = cache slot 1
#2: save dst[0][0] --> Miss line C = cache slot 2
#3: load src[0][1] --> Hit  line A = cache slot 1
#4: save dst[0][1] --> Hit  line C = cache slot 2
#5: load src[1][0] --> Miss line B = cache slot 3
#6: save dst[1][0] --> Miss line D = cache slot 4
#7: load src[1][1] --> Hit  line B = cache slot 3
#8: save dst[1][1] --> Hit  line D = cache slot 4

They are identical. The only difference would be if you reran transpose again. In case 1 you would get the exact same behavior (well, you'ld start with the cache full of all the wrong things, so it might as well be empty). In the larger cache case, though, everything you need for the second call is already cached, so there will be no cache misses.

The difference between my answers and yours is most likely due to our assumptions about the behavior of the compiler for your loop count registers (i and j). I would assume they are both stored in registers (and so would not affect the data cache). You may need to assume they are in memory somewhere (and therefore interact with the cache) to get the expected results.

Recommended topics

Hot tags