Write a program to get CPU cache sizes and levels
Asked Answered
R

4

10

I want to write a program to get my cache size(L1, L2, L3). I know the general idea of it.

  1. Allocate a big array
  2. Access part of it of different size each time.

So I wrote a little program. Here's my code:

#include <cstdio>
#include <time.h>
#include <sys/mman.h>

const int KB = 1024;
const int MB = 1024 * KB;
const int data_size = 32 * MB;
const int repeats = 64 * MB;
const int steps = 8 * MB;
const int times = 8;

long long clock_time() {
    struct timespec tp;
    clock_gettime(CLOCK_REALTIME, &tp);
    return (long long)(tp.tv_nsec + (long long)tp.tv_sec * 1000000000ll);
}

int main() {
    // allocate memory and lock
    void* map = mmap(NULL, (size_t)data_size, PROT_READ | PROT_WRITE, 
                     MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
    if (map == MAP_FAILED) {
        return 0;
    }
    int* data = (int*)map;

    // write all to avoid paging on demand
    for (int i = 0;i< data_size / sizeof(int);i++) {
        data[i]++;
    }

    int steps[] = { 1*KB, 4*KB, 8*KB, 16*KB, 24 * KB, 32*KB, 64*KB, 128*KB, 
                    128*KB*2, 128*KB*3, 512*KB, 1 * MB, 2 * MB, 3 * MB, 4 * MB, 
                    5 * MB, 6 * MB, 7 * MB, 8 * MB, 9 * MB};
    for (int i = 0; i <= sizeof(steps) / sizeof(int) - 1; i++) {
        double totalTime = 0;    
        for (int k = 0; k < times; k++) {
            int size_mask = steps[i] / sizeof(int) - 1;
            long long start = clock_time();
            for (int j = 0; j < repeats; j++) {
                ++data[ (j * 16) & size_mask ];
            }
            long long end = clock_time();
            totalTime += (end - start) / 1000000000.0;
        }
        printf("%d time: %lf\n", steps[i] / KB, totalTime);
    }
    munmap(map, (size_t)data_size);
    return 0;
}

However, the result is so weird:

1 time: 1.989998
4 time: 1.992945
8 time: 1.997071
16 time: 1.993442
24 time: 1.994212
32 time: 2.002103
64 time: 1.959601
128 time: 1.957994
256 time: 1.975517
384 time: 1.975143
512 time: 2.209696
1024 time: 2.437783
2048 time: 7.006168
3072 time: 5.306975
4096 time: 5.943510
5120 time: 2.396078
6144 time: 4.404022
7168 time: 4.900366
8192 time: 8.998624
9216 time: 6.574195

My CPU is Intel(R) Core(TM) i3-2350M. L1 Cache: 32K (for data), L2 Cache 256K, L3 Cache 3072K. Seems like it doesn't follow any rule. I can't get information of cache size or cache level from that. Could anybody give some help? Thanks in advance.

Update: Follow @Leeor advice, I use j*64 instead of j*16. New results:

1 time: 1.996282
4 time: 2.002579
8 time: 2.002240
16 time: 1.993198
24 time: 1.995733
32 time: 2.000463
64 time: 1.968637
128 time: 1.956138
256 time: 1.978266
384 time: 1.991912
512 time: 2.192371
1024 time: 2.262387
2048 time: 3.019435
3072 time: 2.359423
4096 time: 5.874426
5120 time: 2.324901
6144 time: 4.135550
7168 time: 3.851972
8192 time: 7.417762
9216 time: 2.272929
10240 time: 3.441985
11264 time: 3.094753

Two peaks, 4096K and 8192K. Still weird.

Rushton answered 2/10, 2013 at 12:25 Comment(0)
R
6

I'm not sure if this is the only problem here, but it's definitely the biggest one - your code would very quickly trigger the HW stream prefetchers, making you almost always hit in L1 or L2 latencies.

More details can be found here - http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers

For your benchmark You should either disable them (through BIOS or any other means), or at least make your steps longer by replacing j*16 (* 4 bytes per int = 64B, one cache line - a classic unit stride for the stream detector), with j*64 (4 cache lines). The reason being - the prefetcher can issue 2 prefetches per stream request, so it runs ahead of your code when you do unit strides, may still get a bit ahead of you when your code is jumping over 2 lines, but become mostly useless with longer jumps (3 isn't good because of your modulu, you need a divider of step_size)

Update the questions with the new results and we can figure out if there's anything else here.


EDIT1: Ok, I ran the fixed code and got -

1 time: 1.321001
4 time: 1.321998
8 time: 1.336288
16 time: 1.324994
24 time: 1.319742
32 time: 1.330685
64 time: 1.536644
128 time: 1.536933
256 time: 1.669329
384 time: 1.592145
512 time: 2.036315
1024 time: 2.214269
2048 time: 2.407584
3072 time: 2.259108
4096 time: 2.584872
5120 time: 2.203696
6144 time: 2.335194
7168 time: 2.322517
8192 time: 5.554941
9216 time: 2.230817

It makes much more sense if you ignore a few columns - you jump after the 32k (L1 size), but instead of jumping after 256k (L2 size), we get too good of a result for 384, and jump only at 512k. Last jump is at 8M (my LLC size), but 9k is broken again.

This allows us to spot the next error - ANDing with size mask only makes sense when it's a power of 2, otherwise you don't wrap around, but instead repeat some of the last addresses again (which ends up in optimistic results since it's fresh in the cache).

Try replacing the ... & size_mask with % steps[i]/sizeof(int), the modulu is more expensive but if you want to have these sizes you need it (or alternatively, a running index that gets zeroed whenever it exceeds the current size)

Ryon answered 2/10, 2013 at 16:32 Comment(2)
Thank you!. I use j * 64 instead, but result remains confusing. Please see my update.Rushton
@Ryon [stackoverflow.com/users/2016408/leeor](Leeor) I am wondering that original question as well as your suggestion considers that irrespective of how much data we access, data is brought into the cache in the form of cache line at a time.. so, let's say that the cache size is 32 KB or 64 KB, since only 64 bytes (assuming cache line is 64 bytes) is brought into the cache, then a way to determine cache size is access previous block again to check if it's been evicted or still stays put? Do you implement such mechanics into the C-code. I may have missed or didn't understand the code?Resolute
O
4

I think you'd be better off looking at the CPUID instruction. It's not trivial, but there should be information on the web.

Also, if you're on Windows, you can use GetLogicalProcessorInformation function. Mind you, it's only present in Windows XP SP3 and above. I know nothing about Linux/Unix.

Oilstone answered 2/10, 2013 at 12:40 Comment(2)
That's a good way. But I truly want to understand how cache really works in detail and why my code behaves like this. Thanks anyway.Rushton
@KanLiu - Have you seen this article? It has all the details, including caches 'n stuff.Oilstone
S
2

If you're using GNU/Linux you can just read the content of the files /proc/cpuinfo and for further details /sys/devices/system/cpu/*. It is just common under UNIX not to define a API, where a plain file can do that job anyway.

I would also take a look at the source of util-linux, it contains a program named lscpu. This should be give you an example how to retrieve the required information.

// update
http://git.kernel.org/cgit/utils/util-linux/util-linux.git/tree/sys-utils/lscpu.c

If just taken a look at the source their. It basically reading from the file mentioned above, thats all. An therefore it is absolutely valid to read also from that files, they are provided by the kernel.

Supper answered 2/10, 2013 at 16:51 Comment(1)
Thanks. As I said the sizes and levels of cache doesn't match my code's behavior, and I want to know the reason.Rushton
E
1

It's possible using straight C as follows:

cache.c

#include <stdio.h>
#include <cpuid.h>

int main() {
    unsigned int eax, ebx, ecx, edx;
    int cacheType, cacheLevel;
    unsigned int cacheSize, ways, partitions, lineSize, sets;

    for (int i = 0; ; i++) {
        /* The value 0x04 for EAX is a specific function number that instructs 
           the cpuid instruction to get cache parameters for Intel CPUs. */
        __cpuid_count(0x04, i, eax, ebx, ecx, edx);

        cacheType = eax & 0x1F;
        if (cacheType == 0) {
            break; // No more caches
        }

        cacheLevel = (eax >> 5) & 0x7;
        ways = ((ebx >> 22) & 0x3FF) + 1;
        partitions = ((ebx >> 12) & 0x3FF) + 1;
        lineSize = (ebx & 0xFFF) + 1;
        sets = ecx + 1;

        cacheSize = ways * partitions * lineSize * sets;

        printf("Cache Type: %d, Cache Level: L%d, Cache Size: %u bytes\n", cacheType, cacheLevel, cacheSize);
    }

    return 0;
}

For an Intel CPU, the results follow:

cache

The first L1 is the data cache, the second L1 is the instruction cache, followed by L2 and L3 which are unified caches containing data/instructions.

Environmentalist answered 17/8, 2023 at 16:50 Comment(1)
Nice example of the basics. I forget if there have been any CPUs with errata for these CPUID levels. Also, this doesn't tell you which levels are shared or per-core private. Usually L1d/i and L2 are per-core private, and L3 is shared either by all cores on that socket (Intel since i7) or by a cluster of 4 or 8 cores (AMD Zen). A multi-socket system, or a big AMD, can have multiple instances of L3, but the amount reported by CPUID is hopefully the amount that the current core can actually benefit from. Intel E-cores in Alder Lake share an L2 cache between groups of 4 cores.Shimberg

© 2022 - 2024 — McMap. All rights reserved.