clflush to invalidate cache line via C function
Asked Answered
I

2

8

I am trying to use clflush to manually evicts a cache line in order to determine cache and line sizes. I didn't find any guide on how to use that instruction. All I see, are some codes that use higher level functions for that purpose.

There is a kernel function void clflush_cache_range(void *vaddr, unsigned int size), but still I don't know what to include in my code and how to use that. I don't know what is the size in that function.

More than that, how can I be sure that the line is evicted in order to verify the correctness of my code?

UPDATE:

Here is a initial code for what I am trying to do.

#include <immintrin.h>
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
  int array[ 100 ];
  /* will bring array in the cache */
  for ( int i = 0; i < 100; i++ )
    array[ i ] = i;

  /* FLUSH A LINE */
  /* each element is 4 bytes */
  /* assuming that cache line size is 64 bytes */
  /* array[0] till array[15] is flushed */
  /* even if line size is less than 64 bytes */
  /* we are sure that array[0] has been flushed */
  _mm_clflush( &array[ 0 ] );



  int tm = 0;
  register uint64_t time1, time2, time3;


  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
  printf( "miss latency = %lu \n", time2 );

  time3 = __rdtscp( &array[ 0 ] ) - time2; /* array[0] is a cache hit */
  printf( "hit latency = %lu \n", time3 );
  return 0;
}

Before running the code, I would like to manually verify that it is a correct code. Am I in the correct path? Did I use _mm_clflush correctly?

UPDATE:

Thanks to Peter's comment, I fixed the the code as follows

  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
  printf( "miss latency = %lu \n", time2 );
  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache hit */
  printf( "hit latency = %lu \n", time1 );

By running the code multiple times, I get the following output

$ ./flush
miss latency = 238
hit latency = 168
$ ./flush
miss latency = 154
hit latency = 140
$ ./flush
miss latency = 252
hit latency = 140
$ ./flush
miss latency = 266
hit latency = 252

The first run seems to be reasonable. But the second run looks odd. By running the code from the command line, every time the array is initialized with the values and then I explicitly evict the first line.

UPDATE4:

I tried Hadi-Brais code and here are the outputs

naderan@webshub:~$ ./flush3
address = 0x7ffec7a92220
array[ 0 ] = 0
miss section latency = 378
array[ 0 ] = 0
hit section latency = 175
overhead latency = 161
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 217 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffedbe0af40
array[ 0 ] = 0
miss section latency = 392
array[ 0 ] = 0
hit section latency = 231
overhead latency = 168
Measured L1 hit latency = 63 TSC cycles
Measured main memory latency = 224 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffead7fdc90
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 252 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffe51a77310
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 182
overhead latency = 161
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 203 TSC cycles

Slightly different latencies are acceptable. However hit latency of 63 compared to 21 and 14 is also observable.

UPDATE5:

As I checked the Ubuntu, there is no power saving feature enabled. Maybe the frequency change is disabled in the bios, or there is a miss configuration

$ cat /proc/cpuinfo  | grep -E "(model|MHz)"
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
cpu MHz         : 2097.571
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz  
cpu MHz         : 2097.571
$ lscpu | grep MHz
CPU MHz:             2097.571

Anyway, that means the frequency is set to its maximum value which is what I have to care. By running multiple times, I see some different values. Are these normal?

$ taskset -c 0 ./flush3
address = 0x7ffe30c57dd0
array[ 0 ] = 0
miss section latency = 602
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 455 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffd16932fd0
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 168
overhead latency = 147
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 252 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffeafb96580
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 161
overhead latency = 140
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffe58291de0
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 168
overhead latency = 140
Measured L1 hit latency = 28 TSC cycles
Measured main memory latency = 217 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7fffa76d20b0
array[ 0 ] = 0
miss section latency = 371
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffdec791580
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 189
overhead latency = 147
Measured L1 hit latency = 42 TSC cycles
Measured main memory latency = 210 TSC cycles
Ishii answered 13/8, 2018 at 8:58 Comment(9)
Your GDB output from disas /m has giant gaps, like from 0x69e to 0x6cd (or about 50 bytes of machine code). According to help disas: Only the main source file is displayed, not those of, e.g., any inlined functions. This modifier hasn't proved useful in practice and is deprecated in favor of /s. _mm_clflush is an inline function. Also you forgot to compile with optimization enabled, so your function is full of wasted instructions. And you're still using the useless _rdtscp( &array[ 0 ] ) thing that does a store to the array after reading the clock.Hamon
@PeterCordes: I wrote UPDATE4. Regarding _rdtscp( &array[ 0 ] ), you say that it is not good for my purpose. I read the manual and accept that. However, I didn't find any alternative for that. Do you mean that __rdtsc which Hadi-Brais used in his code is the right choice? I understand that from your comment about that.Ishii
Hadi's answer explains why and how he's using a read inside the timed region, with temp = array[0]. It compiles to asm that does what we want (if you use gcc -O3.)Hamon
When you ran Hadi's code, you probably didn't control for CPU frequency scaling. RDTSC counts at a fixed frequency, regardless of the core clock speed. So it's perfectly reasonable to see variations up to a factor of 5 on a 4GHz CPU (rated frequency = reference frequency) that idles at 0.8GHz (actually frequency when the program first starts). That's why I ran an infinite loop in the background to get my CPU to ramp up to max before running Hadi's code, see my comments under his answer. If you have a Skylake, maybe sometimes your CPU ramped up fast enough to see a lower time.Hamon
What Peter has said is critically important and you should understand it very well. TSC cycles have fixed periods, and so they measure wall clock time. In contrast, core cycles do NOT measure wall clock time under frequency scaling because different cycles have different periods. If the whole program fully runs within the core frequency domain, the core cycle count will be the same each run irrespective of frequency changes. However, the TSC cycle count will be different depending on frequency, because it directly translates into execution time.Yeti
L1 hit latency is on average 4 cycles of core frequency, independent of the frequency. But these 4 cycles can be equal to different amounts of TSC cycles, depending on the frequency. So if you to get fairly reproducible measurements, you need to fix the frequency. Read the comment at the top of my code very carefully.Yeti
Please see UPDATE5Ishii
@PeterCordes: I appreciate if you see this stackoverflow.com/questions/52083481Ishii
@HadiBrais: I appreciate if you see this stackoverflow.com/questions/52083481Ishii
Y
13

You have multiple errors in the code that may lead the nonsensical measurements that you're seeing. I've fixed the errors and you can find the explanation in the comments below.

/* compile with gcc at optimization level -O3 */
/* set the minimum and maximum CPU frequency for all cores using cpupower to get meaningful results */ 
/* run using "sudo nice -n -20 ./a.out" to minimize possible context switches, or at least use "taskset -c 0 ./a.out" */
/* you can optionally use a p-state scaling driver other than intel_pstate to get more reproducable results */
/* This code still needs improvement to obtain more accurate measurements,
   and a lot of effort is required to do that—argh! */
/* Specifically, there is no single constant latency for the L1 because of
   the way it's designed, and more so for main memory. */
/* Things such as virtual addresses, physical addresses, TLB contents,
   code addresses, and interrupts may have an impact that needs to be
   investigated */
/* The instructions that GCC puts unnecessarily in the timed section are annoying AF */
/* This code is written to run on Intel processors! */

#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
  int array[ 100 ];

  /* this is optional */
  /* will bring array in the cache */
  for ( int i = 0; i < 100; i++ )
    array[ i ] = i;

  printf( "address = %p \n", &array[ 0 ] ); /* guaranteed to be aligned within a single cache line */

  _mm_mfence();                      /* prevent clflush from being reordered by the CPU or the compiler in this direction */

  /* flush the line containing the element */
  _mm_clflush( &array[ 0 ] );

  //unsigned int aux;
  uint64_t time1, time2, msl, hsl, osl; /* initial values don't matter */

  /* You can generally use rdtsc or rdtscp.
     See: https://mcmap.net/q/14611/-is-there-any-difference-in-between-rdtsc-lfence-rdtsc-and-rdtsc-rdtscp-in-measuring-execution-time
     I AM NOT SURE THOUGH THAT THE SERIALIZATION PROERTIES OF
     RDTSCP ARE APPLICABLE AT THE COMPILER LEVEL WHEN USING THE
     __RDTSCP INTRINSIC. THIS IS TRUE FOR PURE FENCES SUCH AS LFENCE. */

  _mm_mfence();                      /* this properly orders both clflush and rdtsc*/
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc and the load */
  int temp = array[ 0 ];             /* array[0] is a cache miss */
  /* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc and the load*/
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
  msl = time2 - time1;

  printf( "array[ 0 ] = %i \n", temp );             /* prevent the compiler from optimizing the load */
  printf( "miss section latency = %lu \n", msl );   /* the latency of everything in between the two rdtsc */

  _mm_mfence();                      /* this properly orders both clflush and rdtsc*/
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc and the load */
  temp = array[ 0 ];                 /* array[0] is a cache hit as long as the OS, a hardware prefetcher, or a speculative accesses to the L1D or lower level inclusive caches don't evict it */
  /* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc and the load */
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
  hsl = time2 - time1;

  printf( "array[ 0 ] = %i \n", temp );            /* prevent the compiler from optimizing the load */
  printf( "hit section latency = %lu \n", hsl );   /* the latency of everything in between the two rdtsc */


  _mm_mfence();                      /* this properly orders both clflush and rdtsc */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
  osl = time2 - time1;

  printf( "overhead latency = %lu \n", osl ); /* the latency of everything in between the two rdtsc */


  printf( "Measured L1 hit latency = %lu TSC cycles\n", hsl - osl ); /* hsl is always larger than osl */
  printf( "Measured main memory latency = %lu TSC cycles\n", msl - osl ); /* msl is always larger than osl and hsl */

  return 0;
}

Highly recommended: Memory latency measurement with time stamp counter.

Related: How can I create a spectre gadget in practice?.

Yeti answered 13/8, 2018 at 21:41 Comment(27)
rdtscp doesn't need a preceding lfence, that's why the OP was using it instead of rdtsc. All previous instructions have to execute before it samples the time. (But it doesn't necessarily make later instructions wait for that to happen.)Hamon
I think you want volatile int array [100]; to measure read-miss latency. If the compiler inlines _mm_clflush, the address never escapes the function, so it's not necessarily ordered by a full compiler memory barrier like _mm_mfence or asm("":::"memory");. So it might CSE away the 2nd load, or move the first out of the timing interval. Also, array[0] might be in the same cache line as some other stack locals that compiler-generated code touches. So it might be made hot again before the read. array[32] is probably a better bet with sizeof(int)==4; plenty far from the ends.Hamon
@PeterCordes I could not get rdtscp to work reliably due to the store that it performs and other instructions that the compiler puts after it.Yeti
@PeterCordes Good point. Would the compiler consider clflush on array[0] as an access to array if it was marked volatile? Can it recognize that? It's probably safer to use mfence just before the flush.Yeti
Of course you can't use _mm_rdtscp()'s output operand as the array access (because the asm instruction has a register destination, so the store is outside the timed interval), but you can use a dummy output and use time2 = _mm_rdtscp(&dummy) instead of lfence; time2=rdtsc.Hamon
You're asking if the compiler would do a dummy load after _mm_clflush(&array[32])? I don't think so; but you might get a warning about discarding volatile in the conversion from volatile int* to void *. There's definitely no dereference of a volatile int* in clflush.Hamon
@PeterCordes so making array volatile would not prevent _mm_clflush from being reordered in the direction away from rdtsc, potentially with other accesses to array, right? In that case, why would making array volatile useful? Instead, a fence should be placed just before the flush instruction. Even the CPU might reorder the flush in that direction.Yeti
@PeterCordes Regarding rdtscp, I was talking about the Aux output of rdtscp. The dummy variable would be allocated from memory and so there would be a store in the time interval and the compiler is putting there other instructions too, disturbing the measurement.Yeti
rdtscp isn't useful for the start of the interval anyway, because it can (in theory) reorder with later loads/stores. (And unfortunately gcc/clang fail to optimize away the unused output, and actually store ecx to the stack after rdtscp). But it's fine for the end of the interval: it can't sample the clock until earlier loads are globally visible (i.e. have taken their value from L1d). It doesn't matter what the compiler does with ECX after rdtscp, because that's outside the timed interval.Hamon
See godbolt.org/g/KL2fAY (I also tidied up some comments, e.g. the 2nd lfence mentions mfence). We get rdtsc; mov/shift/or; lfence; mov esi, dword ptr [rsp + 144]; rdtscp, so only LFENCE + extra ALU ops in the timed intervals. With the CPU core clock frequency ~= reference frequency, SKL gives 44c hit, 43c overhead, and 215c miss. So ` L1 hit latency = 1 TSC cycles` and main memory latency = 172 TSC cycles. On multiple runs, measured L1 latency varies from 7 to -1 cycles...Hamon
Your version with lfence+rdtsc at the end of timing intervals bounces around from 5 to 16 reference cycles for L1 hit latency (again with a busy loop running on another core to peg the frequency). So there is a difference. Maybe yours is actually better, but IDK. The difference between cache hit or miss is about the same, as expected, but very noisy.Hamon
@PeterCordes It does not make sense for the L1 hit latency to be non-positive. That means there is too much noise. This is what I (and the OP) have observed when using rdtscp. On my Haswell, when using rdtsc, the L1 hit latency ranges between 3-6 reference cycles on multiple runs. That makes sense.Yeti
Yup, it looks like rdtscp is noisier than lfence+rdtsc. From the ISA reference manual, I expected it to be about the same, so I'm not sure what's different. I don't think it's the output operand that's the problem, though. rdtscp is 2 more uops than rdtsc, but lfence itself is only 2 uops. Anyway, on SKL I'm getting much more variability than you describe for L1d.Hamon
I would like to ask what is the point of using -O3? Using that may change the order of instruction by the compiler. So, it may make sense that fence instructions are used. However, I didn't use such options. I compiled with gcc -o flush flush.c. So, I don't think fence is required at at! Am I right? Any idea?Ishii
@Ishii Using -O3 helps reduce the amount of noise inside the timed section of the code by removing expensive instructions. You can emit the binary using -O3 and -O0 and compare the assembly code and see the difference. Fences are required not just for the compiler (when optimizations are used), but also for the CPU itself. You cannot turn off the optimizations that the CPU itself performs. So the fences are critical to obtain a reliable measurement. You can do slightly better if you write the whole code in assembly instead of C, because there you have absolute control over the timed section.Yeti
Each fence has a purpose as explained in the comments in the code.Yeti
@HadiBrais: Please see UPDATE4 in the first post.Ishii
1) int temp = array[ 0 ]; should be a read miss since the value of array[0] is read. Why did you consider that as write miss? 2) Why didn't you use _mm_mfence for time2? Is that redundant if I use that? Or it has side effects? 3) temp = array[ 0 ]; should be hit but your wrote "cache miss" in the comments.Ishii
@Ishii 1) Yes. The comment about the write miss applies to the code in the question, not my code. 2) _mm_mfence is redundant in that place for this particular code. It may add more overhead. 3) Fixed.Yeti
@HadiBrais: What happens if I remove mfence or replace that with lfence? According to the manual Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction So, I guess you can put an lfence before the clflush instead of using mfence. Isn't that correct?Ishii
@Ishii Both mfence and lfence order clflush on Intel processors. However, mfence flushes the store buffer as well. But here we have no writes to array[0], so I don't think that would make a difference. However, on most Intel processors, mfence is cheaper than lfence. But mfence is used only once before clflush, so that wouldn't matter that much. I think it's OK to replace the mfence before clflush with lfence in this code.Yeti
@HadiBrais: Why did you include two lfence in the osl section? Because there are two lfence before and after temp = array[ 0 ];?Ishii
@PeterCordes and HadiBrais: Please see the new topic at stackoverflow.com/questions/51963834 I appreciate your comments.Ishii
@HadiBrais Could you explain the reason why we need to pin CPU frequency to the maximal? Is it because TSC counter runs at nominal frequency of the core? Is it because, while TSC increases by one at every timer tick running at nominal CPU frequency, say that 3.4GHz, the actual frequency used for running cores can vary depending on processor states, and can result in different results for time measuring?Twila
@JaehyukLee I meant setting the core frequency to be equal to the TSC frequency (which is, usually, approximately equal to the nominal frequency). This makes it easier to understand TSC measurements.Yeti
@PeterCordes What does "CSE" mean in your comment? I tried googling but failed.Sil
@zgc: en.wikipedia.org/wiki/Common_subexpression_eliminationHamon
H
5

You know you can query the line size with cpuid, right? Do that if you actually want to find it programmatically. (Otherwise, assume it's 64 bytes, because it is on everything after PIII.)

But sure if want to use clflush or clflushopt from C for whatever reason, use void _mm_clflush(void const *p) or void _mm_clflushopt(void const *p), from #include <immintrin.h>. (See Intel's insn set ref manual entry for clflush or clflushopt).

GCC, clang, ICC, and MSVC all support Intel's <immintrin.h> intrinsics.


You could also have found this by searching Intel's intrinsics guide for clflush to find definitions for the intrinsics for that instruction.

see also https://stackoverflow.com/tags/x86/info for more links to guides, docs, and reference manuals.


More than that, how can I be sure that the line is evicted in order to verify the correctness of my code?

Look at the compiler's asm output, or single-step it in a debugger. If/when clflush executes, that cache line is evicted at that point in your program.

Hamon answered 13/8, 2018 at 9:6 Comment(14)
Are these valid functions in gcc? Or they are specific for intel compiler?Ishii
@mahmood. All 4 mainstream x86 compilers support Intel's intrinsics in <immintrin.h>. gcc, clang, ICC, and MSVC.Hamon
I think I had some progresses. Please see the updated post.Ishii
I ran the code and got miss latency = 252 \n hit latency = 1329294496013. Any idea?Ishii
@mahmood: your 2nd interval includes a printf! Also, rdtscp samples the clock before storing to the output operand. software.intel.com/sites/landingpage/IntrinsicsGuide/…. So the cache miss is part of the 2..3 interval.Hamon
Thanks I got it. However as you can see the updated post, multiple runs of the program yield good/odd output. I doubt if that behavior is related to rdtscp.Ishii
@mahmood: Read my whole comment: the store miss happens outside the timing interval (compiler-generated mov store after the rdtscp instruction), and the store buffer hides latency of store misses from out-of-order execution.Hamon
Is this valid for an AMD processor? (Zen2)Bremser
@onlycparra: "this" being that line size = 64 bytes, and that you can query it via CPUID? Yes, both true on Zen2 and every other modern x86 CPU since maybe Core2 or so. (Older Intel used to use 32-byte lines, but 64 bytes is a good fit for DDR SDRAM max burst size, and for common software usage patterns and so on.)Hamon
Thanks for the quick respones, @PeterCordes :) I meant flushing cache lines. Are clflush/cflushopt exclusive from Intel? I want to accomplish this, but in a Zen2Bremser
@onlycparra: clflush has existed since about SSE2, but has its own CPUID feature flag. So does clflushopt. en.wikichip.org/wiki/amd/microarchitectures/zen_2 confirms that it has the CLFLUSHOPT feature, or you could look at CPUID dumps on instlatx64.atw.hu for any particular Zen2 CPU.Hamon
@PeterCordes Yes! I see it with lscpu | grep clflush. Thanks.Bremser
@PeterCordes Lastly, what should I use to flush many addresses instead of just one? (for example, a block of x elements starting at *ptr)Bremser
@onlycparra: clflushopt in a loop. (With one SFENCE after, if you care about it being ordered wrt. later stores). (e.g. the Linux kernel function clflush_cache_range. See also Is there a way to flush the entire CPU cache related to a program?)Hamon

© 2022 - 2024 — McMap. All rights reserved.