I am trying to use clflush
to manually evicts a cache line in order to determine cache and line sizes. I didn't find any guide on how to use that instruction. All I see, are some codes that use higher level functions for that purpose.
There is a kernel function void clflush_cache_range(void *vaddr, unsigned int size)
, but still I don't know what to include in my code and how to use that. I don't know what is the size
in that function.
More than that, how can I be sure that the line is evicted in order to verify the correctness of my code?
UPDATE:
Here is a initial code for what I am trying to do.
#include <immintrin.h>
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
int array[ 100 ];
/* will bring array in the cache */
for ( int i = 0; i < 100; i++ )
array[ i ] = i;
/* FLUSH A LINE */
/* each element is 4 bytes */
/* assuming that cache line size is 64 bytes */
/* array[0] till array[15] is flushed */
/* even if line size is less than 64 bytes */
/* we are sure that array[0] has been flushed */
_mm_clflush( &array[ 0 ] );
int tm = 0;
register uint64_t time1, time2, time3;
time1 = __rdtscp( &tm ); /* set timer */
time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
printf( "miss latency = %lu \n", time2 );
time3 = __rdtscp( &array[ 0 ] ) - time2; /* array[0] is a cache hit */
printf( "hit latency = %lu \n", time3 );
return 0;
}
Before running the code, I would like to manually verify that it is a correct code. Am I in the correct path? Did I use _mm_clflush
correctly?
UPDATE:
Thanks to Peter's comment, I fixed the the code as follows
time1 = __rdtscp( &tm ); /* set timer */
time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
printf( "miss latency = %lu \n", time2 );
time1 = __rdtscp( &tm ); /* set timer */
time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache hit */
printf( "hit latency = %lu \n", time1 );
By running the code multiple times, I get the following output
$ ./flush
miss latency = 238
hit latency = 168
$ ./flush
miss latency = 154
hit latency = 140
$ ./flush
miss latency = 252
hit latency = 140
$ ./flush
miss latency = 266
hit latency = 252
The first run seems to be reasonable. But the second run looks odd. By running the code from the command line, every time the array is initialized with the values and then I explicitly evict the first line.
UPDATE4:
I tried Hadi-Brais code and here are the outputs
naderan@webshub:~$ ./flush3
address = 0x7ffec7a92220
array[ 0 ] = 0
miss section latency = 378
array[ 0 ] = 0
hit section latency = 175
overhead latency = 161
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 217 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffedbe0af40
array[ 0 ] = 0
miss section latency = 392
array[ 0 ] = 0
hit section latency = 231
overhead latency = 168
Measured L1 hit latency = 63 TSC cycles
Measured main memory latency = 224 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffead7fdc90
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 252 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffe51a77310
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 182
overhead latency = 161
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 203 TSC cycles
Slightly different latencies are acceptable. However hit latency of 63 compared to 21 and 14 is also observable.
UPDATE5:
As I checked the Ubuntu, there is no power saving feature enabled. Maybe the frequency change is disabled in the bios, or there is a miss configuration
$ cat /proc/cpuinfo | grep -E "(model|MHz)"
model : 79
model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
cpu MHz : 2097.571
model : 79
model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
cpu MHz : 2097.571
$ lscpu | grep MHz
CPU MHz: 2097.571
Anyway, that means the frequency is set to its maximum value which is what I have to care. By running multiple times, I see some different values. Are these normal?
$ taskset -c 0 ./flush3
address = 0x7ffe30c57dd0
array[ 0 ] = 0
miss section latency = 602
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 455 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffd16932fd0
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 168
overhead latency = 147
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 252 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffeafb96580
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 161
overhead latency = 140
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffe58291de0
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 168
overhead latency = 140
Measured L1 hit latency = 28 TSC cycles
Measured main memory latency = 217 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7fffa76d20b0
array[ 0 ] = 0
miss section latency = 371
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffdec791580
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 189
overhead latency = 147
Measured L1 hit latency = 42 TSC cycles
Measured main memory latency = 210 TSC cycles
disas /m
has giant gaps, like from0x69e
to0x6cd
(or about 50 bytes of machine code). According tohelp disas
: Only the main source file is displayed, not those of, e.g., any inlined functions. This modifier hasn't proved useful in practice and is deprecated in favor of /s._mm_clflush
is an inline function. Also you forgot to compile with optimization enabled, so your function is full of wasted instructions. And you're still using the useless_rdtscp( &array[ 0 ] )
thing that does a store to the array after reading the clock. – Hamon_rdtscp( &array[ 0 ] )
, you say that it is not good for my purpose. I read the manual and accept that. However, I didn't find any alternative for that. Do you mean that__rdtsc
which Hadi-Brais used in his code is the right choice? I understand that from your comment about that. – Ishiitemp = array[0]
. It compiles to asm that does what we want (if you usegcc -O3
.) – Hamon