I am trying to find configuration or memory access pattern for Intel's clwb instruction that would not invalidate cache line. I am testing on Intel Xeon Gold 5218 processor with NVDIMMs. Linux version is 5.4.0-3-amd64. I tried using Device−DAX mode and directly mapping this char device to the address space. I also tried adding this non-volatile memory as a new NUMA node and using numactl --membind
command to bind memory to it. In both cases when I use clwb to cached address, it is evicted. I am observing eviction with PAPI hardware counters, with disabled prefetchers.
This is a simple loop that I am testing. array and tmp variable, both are declared as volatile, so the loads are really executed.
for(int i=0; i < arr_size; i++){
tmp = array[i];
_mm_clwb(& array[i]);
_mm_mfence();
tmp = array[i];
}
Both reads are giving cache misses.
I was wondering if anyone else has tried to detect whether there is some configuration or memory access pattern that would leave the cache line in the cache?
clwb
, your test measures near-zero cache misses? That would rule out a testing error. – Hypothalamusclwb
in the first CPUs to support persistent memory so future libraries can use it without having to do dynamic dispatch based on CPUID, instead of waiting to introduce the instruction with CPUs that support it properly (no eviction). It'll make it much nicer in the long term once there are CPUs that support it. Thanks for posting about this SKX behaviour; like you I'd been assuming CLWB would do what it's designed for. Hopefully it's implemented soon, like Ice Lake. (If that even counts as soon for non-laptops...) – Hypothalamus