Using time stamp counter and clock_gettime for cache miss
Asked Answered
F

1

0

As a follow-up to this topic, in order to calculate the memory miss latency, I have wrote the following code using _mm_clflush, __rdtsc and _mm_lfence (which is based on the code from this question/answer).

As you can see in the code, I first load the array into the cache. Then I flush one element and therefore the cache line is evicted from all cache levels. I put _mm_lfence in order to preserve the order during -O3.

Next, I used time stamp counter to calculate the latency or reading array[0]. As you can see between two time stamps, there are three instructions: two lfence and one read. So, I have to subtract the lfence overhead. The last section of the code calculates that overhead.

At the end of the code, the overhead and miss latency are printed. However, the result is not valid!

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
int main()
{
    int array[ 100 ];
    for ( int i = 0; i < 100; i++ )
            array[ i ] = i;
    uint64_t t1, t2, ov, diff;

    _mm_lfence();
    _mm_clflush( &array[ 0 ] );
    _mm_lfence();

    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    int tmp = array[ 0 ];
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();

    diff = t2 - t1;
    printf( "diff is %lu\n", diff );

    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;
    printf( "lfence overhead is %lu\n", ov );
    printf( "miss cycles is %lu\n", diff-ov );

    return 0;
}

However, the output is not valid

$ gcc -O3 -o flush1 flush1.c
$ taskset -c 0 ./flush1
diff is 161
lfence overhead is 147
miss cycles is 14
$ taskset -c 0 ./flush1
diff is 161
lfence overhead is 154
miss cycles is 7
$ taskset -c 0 ./flush1
diff is 147
lfence overhead is 154
miss cycles is 18446744073709551609

Any thought?

Next, I tried clock_gettime function in order to calculate the miss latency as below

    _mm_lfence();
    _mm_clflush( &array[ 0 ] );
    _mm_lfence();

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    _mm_lfence();
    int tmp = array[ 0 ];
    _mm_lfence();
    clock_gettime(CLOCK_MONOTONIC, &end);
    diff = 1000000000 * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
    printf("miss elapsed time = %lu nanoseconds\n", diff);

The output is miss elapsed time = 578 nanoseconds. Is that reliable?

UPDATE1:

Thanks to Peter and Hadi, to summarize the responses till now, I found out

1- Unused variables are omitted in the optimization phase and that was the reason on weird values I seen in the output. Thanks to Peter's reply, there are some ways to fix that.

2- clock_gettime is not suitable for such resolution and that function is used for larger delays.

As a workaround, I tried to bring the array in to the cache and then flush all elements to be sure that all elements are evicted from all cache levels. Then I measured the latency of array[0] and then array[20]. Since each element is 4-bytes, the distance is 80 bytes. I expect to get two cache misses. However, the latency of array[20] is similar to a cache hit. A safe guess is that the cache line is not 80 bytes. So, maybe array[20] is prefetched by hardware. Not always, but I also see some odd results again

    for ( int i = 0; i < 100; i++ ) {
            _mm_lfence();
            _mm_clflush( &array[ i ] );
            _mm_lfence();
    }

    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    int tmp = array[ 0 ];
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    diff1 = t2 - t1;
    printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );

    _mm_lfence();
    t1 = __rdtsc();
    tmp = array[ 20 ];
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    diff2 = t2 - t1;
    printf( "tmp is %d\ndiff2 is %lu\n", tmp, diff2 );

    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;
    printf( "lfence overhead is %lu\n", ov );
    printf( "TSC1 is %lu\n", diff1-ov );
    printf( "TSC2 is %lu\n", diff2-ov );

Output is

$ ./flush1
tmp is 0
diff1 is 371
tmp is 20
diff2 is 280
lfence overhead is 147
TSC1 is 224
TSC2 is 133
$ ./flush1
tmp is 0
diff1 is 399
tmp is 20
diff2 is 280
lfence overhead is 154
TSC1 is 245
TSC2 is 126
$ ./flush1
tmp is 0
diff1 is 392
tmp is 20
diff2 is 840
lfence overhead is 147
TSC1 is 245
TSC2 is 693
$ ./flush1
tmp is 0
diff1 is 364
tmp is 20
diff2 is 140
lfence overhead is 154
TSC1 is 210
TSC2 is 18446744073709551602

The statement that "HW prefetcher brings other blocks" is about 80% correct then. What is the going on then? Any more accurate statement?

Flamenco answered 22/8, 2018 at 9:32 Comment(13)
between two time stamps, there are three instructions. No, there are three C statements. You didn't show the asm the compiler emitted. There will be at least two extra mov instructions after the first rdtsc but before the 2nd, to save the results in registers that won't be overwritten. Or a shift + OR + mov, or shift + LEA. Get CPU cycle count?. Since you compiled with optimization enabled, that's probably all (and will run in parallel with the load), but don't make the mistake of forgetting talking about C like it's actual asm.Rasla
The compiler can optimize away int tmp = array[ 0 ] because tmp is not used. Check the generated assembly code. Also, make sure you fix the core frequency.Trinia
clock_gettime cannot be used to measure the perf of such tiny pieces of code because its own overhead can be relatively massive. Also it only works at the nanoseconds granularity.Trinia
@HadiBrais: But clock_gettime is said to be be the most precise method. What I understand from your comment is that, measuring cache miss/hit via application layer is not possible. What we can find is only TSC value which is also not convertible to time due to many issues.Flamenco
Measuring a single cache miss is really hard. Much better to create a dependency chain of cache misses that HW prefetching can't handle, e.g. a very long linked list with the elements randomly scattered in different cache lines. (at least 2 cache-lines apart to defeat the L2 spatial prefetcher). Then you time a whole loop that traverses it, or see how many iterations you do in 1 second (with a circular list for example). Measuring cache-miss latency is a fairly well-studied problem, google for what other people have done.Rasla
@PeterCordes: You mentioned prefetching which I was planning to discuss that later. The question is, how can we be sure that HW prefetcher doesn't bring array[0] after _mm_clflush and prior to int tmp = array[0]?Flamenco
Because there's nothing that could trigger it before the demand-load runs. It's not magic, see Intel's optimization manual for descriptions of what triggers the L1d / L2 prefetchers. But that's a moot point if you construct a giant linked list and traverse it instead of trying to measure a single access on its own.Rasla
CLOCK_MONOTONIC is generally good for measuring the perf of code that takes more than a microsecond. But you're measuring something that may take less than 100 nanoseconds. Check this answer on how to convert TSC cycle counts to SI time units.Trinia
Re: update: The L2 spatial prefetcher may be generating a request for the adjacent cache line while the first cache line is still in flight.Rasla
So, how exactly cache attacks, e.g. meltdown and spectre, overcome such issue? Basially they have to disable hw prefetcher since they try to measure adjacent addresses in order to find if they are hit or miss. I am not going to attack because there are some kernel updates to patch them.Flamenco
@mahmood: didn't see your reply because you didn't \@ notify me. The cache-read side-channel as part of a Meltdown or Spectre attack typically uses a stride large enough that HW prefetching can't detect the access pattern. e.g. on separate pages instead of contiguous lines. One of the first google hits for meltdown cache read prefetch stridewas medium.com/@mattklein123/…, which uses a stride of 4096. It could be tougher for Spectre, because your stride is at the mercy of the "gadgets" you can find in the target process.Rasla
@peter-cordes: Thanks for the help. I really appreciate that.Flamenco
You can certainly measure cache misses with (most implementations of) clock_gettime. Such methods usually take about 20-30 ns, with nanosecond resolution, and cache misses take between 50-100 ns. Even if the clock calls took longer than a single cache miss you could determine the difference between a cache hit and a miss if their timing was stable. It's harder to measure such short events, but it can be done. Don't handicap yourself unnecessarily though, use rdtsc.Zedoary
R
3

You broke Hadi's code by removing the read of tmp at the end, so it gets optimized away by gcc. There is no load in your timed region. C statements are not asm instructions.

Look at the compiler-generated asm, e.g. on the Godbolt compiler explorer. You should always be doing this when you're trying to microbenchmark really low-level stuff like this, especially if your timing results are unexpected.

    lfence
    clflush [rcx]
    lfence

    lfence
    rdtsc                     # start of first timed region
    lfence
       # nothing because tmp=array[0] optimized away.
    lfence
    mov     rcx, rax
    sal     rdx, 32
    or      rcx, rdx
    rdtsc                     # end of first timed region
    mov     edi, OFFSET FLAT:.LC2
    lfence

    sal     rdx, 32
    or      rax, rdx
    sub     rax, rcx
    mov     rsi, rax
    mov     rbx, rax
    xor     eax, eax
    call    printf

You get a compiler warning about an unused variable from -Wall, but you can silence that in ways that still optimize away. e.g. your tmp++ doesn't make tmp available to anything outside the function, so it still optimizes away. Silencing the warning is not sufficient: print the value, return the value, or assign it to a volatile variable outside the timed region. (Or use inline asm volatile to require the compiler to have it in a register at some point. Chandler Carruth's CppCon2015 talk about using perf mentions some tricks: https://www.youtube.com/watch?v=nXaxk27zwlk)


In GNU C (at least with gcc and clang -O3), you can force a read by casting to (volatile int*), like this:

// int tmp = array[0];           // replace this
(void) *(volatile int*)array;    // with this

The (void) is to avoid a warning for evaluating an expression in a void context, like writing x;.

This kind of looks like strict-aliasing UB, but my understanding is that gcc defines this behaviour. The Linux kernel casts a pointer to add a volatile qualifier in its ACCESS_ONCE macro, so it's used in one of the codebases that gcc definitely cares about supporting. You could always make the whole array volatile; it doesn't matter if initialization of it can't auto-vectorize.

Anyway, this compiles to

    # gcc8.2 -O3
    lfence
    rdtsc
    lfence
    mov     rcx, rax
    sal     rdx, 32
    mov     eax, DWORD PTR [rsp]    # the load which wasn't there before.
    lfence
    or      rcx, rdx
    rdtsc
    mov     edi, OFFSET FLAT:.LC2
    lfence

Then you don't have to mess around with making sure tmp is used, or with worrying about dead-store elimination, CSE, or constant-propagation. In practice the _mm_mfence() or something else in Hadi's original answer included enough memory-barriering to make gcc actually redo the load for the cache-miss + cache-hit case, but it easily could have optimized away one of the reloads.


Note that this can result in asm that loads into a register but never reads it. Current CPUs do still wait for the result (especially if there's an lfence), but overwriting the result could let a hypothetical CPU discard the load and not wait for it. (It's up to the compiler whether it happens to do something else with the register before the next lfence, like mov part of the rdtsc result there.)

This is tricky / unlikely for hardware to do, because the CPU has to be ready for exceptions, see discussion in comments here.) RDRAND reportedly does work that way (What is the latency and throughput of the RDRAND instruction on Ivy Bridge?), but that's probably a special case.

I tested this myself on Skylake by adding an xor eax,eax to the compiler's asm output, right after the mov eax, DWORD PTR [rsp], to kill the result of the cache-miss load. That didn't affect the timing.

Still, this is a potential gotcha with discarding the results of a volatile load; future CPUs might behave differently. It might be better to sum the load results (outside the timed region) and assign them at the end to a volatile int sink, in case future CPUs start discarding uops that produce unread results. But still use volatile for the loads to make sure they happen where you want them.


Also don't forget to do some kind of warm-up loop to get the CPU up to max speed, unless you want to measure the cache-miss execution time at idle clock speed. It looks like your empty timed region is taking a lot of reference cycles, so your CPU was probably clocked down pretty slow.


So, how exactly cache attacks, e.g. meltdown and spectre, overcome such issue? Basically they have to disable hw prefetcher since they try to measure adjacent addresses in order to find if they are hit or miss.

The cache-read side-channel as part of a Meltdown or Spectre attack typically uses a stride large enough that HW prefetching can't detect the access pattern. e.g. on separate pages instead of contiguous lines. One of the first google hits for meltdown cache read prefetch stride was https://medium.com/@mattklein123/meltdown-spectre-explained-6bc8634cc0c2, which uses a stride of 4096. It could be tougher for Spectre, because your stride is at the mercy of the "gadgets" you can find in the target process.

Rasla answered 23/8, 2018 at 1:22 Comment(10)
Even in golbolt, when I write int tmp=array[0]; tmp++;, the assembly code in the middle of the page shows clflush lfence lfence rdtsc lfence lfence and then some mov instructions. You can check that.Flamenco
@mahmood: yeah, because the tmp++ result is still unused. I updated my answer to explain more clearly what will work and what won't. Why don't you want to use *(volatile int*) array, which my answer shows does work without having to print any extra stuff?Rasla
(void) *(volatile int*)array; works but I was trying to understand the logic step by step. Please see UPDATE 1 in the first post.Flamenco
I think there is still a problem. The compiler seems to insist on emitting two stores inside the first timed section. This would definitely impact the measurement. See the code.Trinia
@HadiBrais: huh, that's weird. clang doesn't do that. I guess the _mm_lfence() is making gcc want to spill a temporary or something? IDK why it stores the same thing twice. At least the lfence doesn't wait for the store to commit, only for it to be written into the store buffer. There's no mfence in the timed region. The main concern is more accesses to stack memory where the array lives, not so much the store instruction itself. There's already a shift and or in the timed region.Rasla
@PeterCordes Yeah, what is gcc doing creating stores out of thin air like that? Makes no sense.Trinia
Looks like a "missed-optimization" bug that should get reported. Do you have time to file it, or should I? It doesn't happen with a simpler version only doing 1 miss and 1 hit: godbolt.org/z/KAbwWO. (The Godbolt link from my answer, with the bug fixed using (void) *(volatile int*)(array+30). i.e. the same thing I showed asm output for in the 2nd section. I still get that without extra stores.) So something is weird with a larger function.Rasla
Regarding the other three non-memory instructions mov/sal/or, aren't these a bug too? IDK whether these instructions are useful or redundant, but if they are needed, then they are probably used for the printf call. I think the compiler is not supposed to schedule instructions across lfence.Trinia
I used clang to compile the code and got the same measurements. These two stores don't seem to have any impact that matters.Trinia
@HadiBrais: I don't think gcc "knows" about lfence's instruction-serialization effect. The reorder-blocking is probably limited to loads/stores to non-local variables, like for reordering across a non-inline function call or asm("":::"memory"). It would be legal for the compiler to just mov to 2 other registers, and do the shift/or outside the timed region. (Or if the final time delta was only 32 bit, to optimize away doing anything with the high half at all, saving instructions). I notice clang chooses to do the shift/or right after rdtsc.Rasla

© 2022 - 2024 — McMap. All rights reserved.