I am testing some of intrinsic operations' behaviors. I got surprised when I noticed that _mm_mfence() issues load instruction from user space, but it does not count in L1 data cache - miss, hit or fill buffer hit. I am using papi's native events such as MEM_INST_RETIRED, and MEM_LOAD_RETIRED to read performance counters. This piece of code:
for(int i=0; i < 1000000; i++){
_mm_mfence();
}
counts ALL_LOADS: 737030, L1_HIT: 99, L1_MISS: 10, FB_HIT: 25. while without mfence, overhead of reading counters is something like this: ALL_LOADS: 125, L1_HIT: 94, L1_MISS: 11, FB_HIT: 24
I checked and sfence and lfence does not have this impact. I am using -O3 for compilation. From compiled file I guess it calls __builtin_ia32_mfence function, but I could not find much on it.
I understand in general what _mm_mfence() does and why we use it, but now the question is more about how it works. It would be great if anyone could explain or give any related article to understand this behavior.