counting L1 cache misses with PAPI_read_counters gives unexpected results

Asked 11/2, 2019 at 20:19 Answered 12/2, 2019 at 7:52

I am trying to use PAPI library to count cache misses. cache hit performance counter is not available on my hardware, that's why I am trying to determine cache hits with no cache misses. I am trying few things. First version of my code is this:

  int numEvents = 2;

  long long values[2];

  int events[2] = {PAPI_L1_DCM, PAPI_L2_TCM};


 if (PAPI_start_counters(events, numEvents) != PAPI_OK )  // !=PAPI_OK

    printf("PAPI error: %d\n", 1);

 for(int i=0; i < arr_size; i++)
  {
    array[i].value = 1;

  }

_mm_mfence();

if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
   fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
   exit(1);
}
miss1 = values[0];

_mm_mfence();

for(int i=0; i < arr_size; i++){
         array[i].value = array[i].value + 9; // (int) sum
}

_mm_mfence();

if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
    fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
    exit(1);
}

miss2 = values[0];

printf("before flush miss_1 %lli, miss_2 %lli \n", miss1, miss2);

the problem is that this piece of code is supposed to give me cache hits, so L1 cache miss should be extremely low. but I get unexpectedly high results for miss_2. With array size of 200, miss_2 is nearly 100. it doesn't give any valid result to judge that it really was hit, because of high number of cache misses.

I also tried to rewrite it like this:

if (PAPI_start_counters(events, numEvents) != PAPI_OK )  // !=PAPI_OK

     printf("PAPI error: %d\n", 1);

for(int i=0; i < arr_size; i++){
         array[i].value = array[i].value + 9; // (int) sum
}

if ( PAPI_stop_counters(values, numEvents) != PAPI_OK)
   printf("PAPI error: 2\n");

printf("before flush miss %lli\n", values[0]);

but this gives even worse result, miss_2 is more than 200. Is there anything I am not doing right? It was supposed to give more precise result, but it's doing terrible now. Or I am missing something.
I have tried without fences, I am sure that at least they don't do any harm. I would really appreciate any suggestion.

The disadvantage of PAPI_read_counters is it's overhead, and not great performance, but now I don't care abut performance, I want to correctly determine cache hits.

Though I was also thinking to use RDMPC but I have not found an example to use it without _asm function overwriting. Is this really the only way to use rdpmc? there does not exist already defined function which I would not have to overwrite?

EDIT: adding compiler code for PAPI_read

    ./prog6:     file format elf64-x86-64


Disassembly of section .init:

00000000000009c0 <_init>:
 9c0:   48 83 ec 08             sub    $0x8,%rsp
 9c4:   48 8b 05 1d 16 20 00    mov    0x20161d(%rip),%rax        # 201fe8 <__gmon_start__>
 9cb:   48 85 c0                test   %rax,%rax
 9ce:   74 02                   je     9d2 <_init+0x12>
 9d0:   ff d0                   callq  *%rax
 9d2:   48 83 c4 08             add    $0x8,%rsp
 9d6:   c3                      retq   

Disassembly of section .plt:

00000000000009e0 <.plt>:
 9e0:   ff 35 6a 15 20 00       pushq  0x20156a(%rip)        # 201f50 <_GLOBAL_OFFSET_TABLE_+0x8>
 9e6:   ff 25 6c 15 20 00       jmpq   *0x20156c(%rip)        # 201f58 <_GLOBAL_OFFSET_TABLE_+0x10>
 9ec:   0f 1f 40 00             nopl   0x0(%rax)

00000000000009f0 <puts@plt>:
 9f0:   ff 25 6a 15 20 00       jmpq   *0x20156a(%rip)        # 201f60 <puts@GLIBC_2.2.5>
 9f6:   68 00 00 00 00          pushq  $0x0
 9fb:   e9 e0 ff ff ff          jmpq   9e0 <.plt>

0000000000000a00 <clock_gettime@plt>:
 a00:   ff 25 62 15 20 00       jmpq   *0x201562(%rip)        # 201f68 <clock_gettime@GLIBC_2.17>
 a06:   68 01 00 00 00          pushq  $0x1
 a0b:   e9 d0 ff ff ff          jmpq   9e0 <.plt>

0000000000000a10 <getpid@plt>:
 a10:   ff 25 5a 15 20 00       jmpq   *0x20155a(%rip)        # 201f70 <getpid@GLIBC_2.2.5>
 a16:   68 02 00 00 00          pushq  $0x2
 a1b:   e9 c0 ff ff ff          jmpq   9e0 <.plt>

0000000000000a20 <__stack_chk_fail@plt>:
 a20:   ff 25 52 15 20 00       jmpq   *0x201552(%rip)        # 201f78 <__stack_chk_fail@GLIBC_2.4>
 a26:   68 03 00 00 00          pushq  $0x3
 a2b:   e9 b0 ff ff ff          jmpq   9e0 <.plt>

0000000000000a30 <PAPI_read_counters@plt>:
 a30:   ff 25 4a 15 20 00       jmpq   *0x20154a(%rip)        # 201f80 <PAPI_read_counters>
 a36:   68 04 00 00 00          pushq  $0x4
 a3b:   e9 a0 ff ff ff          jmpq   9e0 <.plt>

0000000000000a40 <sched_setaffinity@plt>:
 a40:   ff 25 42 15 20 00       jmpq   *0x201542(%rip)        # 201f88 <sched_setaffinity@GLIBC_2.3.4>
 a46:   68 05 00 00 00          pushq  $0x5
 a4b:   e9 90 ff ff ff          jmpq   9e0 <.plt>

0000000000000a50 <PAPI_start_counters@plt>:
 a50:   ff 25 3a 15 20 00       jmpq   *0x20153a(%rip)        # 201f90 <PAPI_start_counters>
 a56:   68 06 00 00 00          pushq  $0x6
 a5b:   e9 80 ff ff ff          jmpq   9e0 <.plt>

0000000000000a60 <PAPI_stop_counters@plt>:
 a60:   ff 25 32 15 20 00       jmpq   *0x201532(%rip)        # 201f98 <PAPI_stop_counters>
 a66:   68 07 00 00 00          pushq  $0x7
 a6b:   e9 70 ff ff ff          jmpq   9e0 <.plt>

0000000000000a70 <malloc@plt>:
 a70:   ff 25 2a 15 20 00       jmpq   *0x20152a(%rip)        # 201fa0 <malloc@GLIBC_2.2.5>
 a76:   68 08 00 00 00          pushq  $0x8
 a7b:   e9 60 ff ff ff          jmpq   9e0 <.plt>

0000000000000a80 <PAPI_strerror@plt>:
 a80:   ff 25 22 15 20 00       jmpq   *0x201522(%rip)        # 201fa8 <PAPI_strerror>
 a86:   68 09 00 00 00          pushq  $0x9
 a8b:   e9 50 ff ff ff          jmpq   9e0 <.plt>

0000000000000a90 <__printf_chk@plt>:
 a90:   ff 25 1a 15 20 00       jmpq   *0x20151a(%rip)        # 201fb0 <__printf_chk@GLIBC_2.3.4>
 a96:   68 0a 00 00 00          pushq  $0xa
 a9b:   e9 40 ff ff ff          jmpq   9e0 <.plt>

0000000000000aa0 <getrusage@plt>:
 aa0:   ff 25 12 15 20 00       jmpq   *0x201512(%rip)        # 201fb8 <getrusage@GLIBC_2.2.5>
 aa6:   68 0b 00 00 00          pushq  $0xb
 aab:   e9 30 ff ff ff          jmpq   9e0 <.plt>

0000000000000ab0 <exit@plt>:
 ab0:   ff 25 0a 15 20 00       jmpq   *0x20150a(%rip)        # 201fc0 <exit@GLIBC_2.2.5>
 ab6:   68 0c 00 00 00          pushq  $0xc
 abb:   e9 20 ff ff ff          jmpq   9e0 <.plt>

0000000000000ac0 <fwrite@plt>:
 ac0:   ff 25 02 15 20 00       jmpq   *0x201502(%rip)        # 201fc8 <fwrite@GLIBC_2.2.5>
 ac6:   68 0d 00 00 00          pushq  $0xd
 acb:   e9 10 ff ff ff          jmpq   9e0 <.plt>

0000000000000ad0 <__fprintf_chk@plt>:
 ad0:   ff 25 fa 14 20 00       jmpq   *0x2014fa(%rip)        # 201fd0 <__fprintf_chk@GLIBC_2.3.4>
 ad6:   68 0e 00 00 00          pushq  $0xe
 adb:   e9 00 ff ff ff          jmpq   9e0 <.plt>

Disassembly of section .plt.got:

0000000000000ae0 <__cxa_finalize@plt>:
 ae0:   ff 25 12 15 20 00       jmpq   *0x201512(%rip)        # 201ff8 <__cxa_finalize@GLIBC_2.2.5>
 ae6:   66 90                   xchg   %ax,%ax

Disassembly of section .text:

0000000000000af0 <main>:
     af0:   41 57                   push   %r15
     af2:   b9 0f 00 00 00          mov    $0xf,%ecx
     af7:   41 56                   push   %r14
     af9:   41 55                   push   %r13
     afb:   41 54                   push   %r12
     afd:   55                      push   %rbp
     afe:   53                      push   %rbx
     aff:   48 81 ec 78 01 00 00    sub    $0x178,%rsp
     b06:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
     b0d:   00 00 
     b0f:   48 89 84 24 68 01 00    mov    %rax,0x168(%rsp)
     b16:   00 
     b17:   31 c0                   xor    %eax,%eax
     b19:   48 8d 9c 24 e0 00 00    lea    0xe0(%rsp),%rbx
     b20:   00 
     b21:   48 b8 00 00 00 80 07    movabs $0x8000000780000000,%rax
     b28:   00 00 80 
     b2b:   48 c7 84 24 e0 00 00    movq   $0x1,0xe0(%rsp)
     b32:   00 01 00 00 00 
     b37:   48 8d 53 08             lea    0x8(%rbx),%rdx
     b3b:   48 89 84 24 c8 00 00    mov    %rax,0xc8(%rsp)
     b42:   00 
     b43:   31 c0                   xor    %eax,%eax
     b45:   48 89 d7                mov    %rdx,%rdi
     b48:   f3 48 ab                rep stos %rax,%es:(%rdi)
     b4b:   e8 c0 fe ff ff          callq  a10 <getpid@plt>
     b50:   48 89 da                mov    %rbx,%rdx
     b53:   be 80 00 00 00          mov    $0x80,%esi
     b58:   89 c7                   mov    %eax,%edi
     b5a:   e8 e1 fe ff ff          callq  a40 <sched_setaffinity@plt>
     b5f:   85 c0                   test   %eax,%eax
     b61:   0f 85 17 03 00 00       jne    e7e <main+0x38e>
     b67:   0f ae f0                mfence 
     b6a:   48 8d 74 24 10          lea    0x10(%rsp),%rsi
     b6f:   bf 02 00 00 00          mov    $0x2,%edi
     b74:   0f ae f0                mfence 
     b77:   e8 84 fe ff ff          callq  a00 <clock_gettime@plt>
     b7c:   0f 31                   rdtsc  
     b7e:   bf 00 fa 00 00          mov    $0xfa00,%edi
     b83:   0f ae f0                mfence 
     b86:   48 c1 e2 20             shl    $0x20,%rdx
     b8a:   49 89 c6                mov    %rax,%r14
     b8d:   49 09 d6                or     %rdx,%r14
     b90:   e8 db fe ff ff          callq  a70 <malloc@plt>
     b95:   48 8d bc 24 c8 00 00    lea    0xc8(%rsp),%rdi
     b9c:   00 
     b9d:   be 02 00 00 00          mov    $0x2,%esi
     ba2:   49 89 c4                mov    %rax,%r12
     ba5:   e8 a6 fe ff ff          callq  a50 <PAPI_start_counters@plt>
     baa:   85 c0                   test   %eax,%eax
     bac:   0f 85 88 02 00 00       jne    e3a <main+0x34a>
     bb2:   4d 89 e7                mov    %r12,%r15
     bb5:   49 8d 84 24 00 fa 00    lea    0xfa00(%r12),%rax
     bbc:   00 
     bbd:   4c 89 e5                mov    %r12,%rbp
     bc0:   c7 45 00 01 00 00 00    movl   $0x1,0x0(%rbp)
     bc7:   48 83 c5 40             add    $0x40,%rbp
     bcb:   48 39 e8                cmp    %rbp,%rax
     bce:   75 f0                   jne    bc0 <main+0xd0>
     bd0:   4c 8d ac 24 d0 00 00    lea    0xd0(%rsp),%r13
     bd7:   00 
     bd8:   be 02 00 00 00          mov    $0x2,%esi
     bdd:   4c 89 ef                mov    %r13,%rdi
     be0:   e8 4b fe ff ff          callq  a30 <PAPI_read_counters@plt>
     be5:   85 c0                   test   %eax,%eax
     be7:   0f 85 b8 02 00 00       jne    ea5 <main+0x3b5>
     bed:   48 8b 84 24 d0 00 00    mov    0xd0(%rsp),%rax
     bf4:   00 
     bf5:   4c 89 e3                mov    %r12,%rbx
     bf8:   48 89 44 24 08          mov    %rax,0x8(%rsp)
     bfd:   0f 1f 00                nopl   (%rax)
     c00:   83 03 09                addl   $0x9,(%rbx)
     c03:   48 83 c3 40             add    $0x40,%rbx
     c07:   48 39 dd                cmp    %rbx,%rbp
     c0a:   75 f4                   jne    c00 <main+0x110>
     c0c:   31 d2                   xor    %edx,%edx
     c0e:   48 8d 35 88 04 00 00    lea    0x488(%rip),%rsi        # 109d <_IO_stdin_used+0x2d>
     c15:   bf 01 00 00 00          mov    $0x1,%edi
     c1a:   31 c0                   xor    %eax,%eax
     c1c:   e8 6f fe ff ff          callq  a90 <__printf_chk@plt>
     c21:   be 02 00 00 00          mov    $0x2,%esi
     c26:   4c 89 ef                mov    %r13,%rdi
     c29:   e8 02 fe ff ff          callq  a30 <PAPI_read_counters@plt>
     c2e:   85 c0                   test   %eax,%eax
     c30:   0f 85 6f 02 00 00       jne    ea5 <main+0x3b5>
     c36:   48 8b 8c 24 d0 00 00    mov    0xd0(%rsp),%rcx
     c3d:   00 
     c3e:   48 8b 54 24 08          mov    0x8(%rsp),%rdx
     c43:   48 8d 35 e6 04 00 00    lea    0x4e6(%rip),%rsi        # 1130 <_IO_stdin_used+0xc0>
     c4a:   31 c0                   xor    %eax,%eax
     c4c:   bf 01 00 00 00          mov    $0x1,%edi
     c51:   e8 3a fe ff ff          callq  a90 <__printf_chk@plt>
     c56:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
     c5d:   00 00 00 
     c60:   41 0f ae 3c 24          clflush (%r12)
     c65:   49 83 c4 40             add    $0x40,%r12
     c69:   49 39 dc                cmp    %rbx,%r12
     c6c:   75 f2                   jne    c60 <main+0x170>
     c6e:   be 02 00 00 00          mov    $0x2,%esi
     c73:   4c 89 ef                mov    %r13,%rdi
     c76:   e8 b5 fd ff ff          callq  a30 <PAPI_read_counters@plt>
     c7b:   85 c0                   test   %eax,%eax
     c7d:   0f 85 22 02 00 00       jne    ea5 <main+0x3b5>
     c83:   48 8b ac 24 d0 00 00    mov    0xd0(%rsp),%rbp
     c8a:   00 
     c8b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
     c90:   41 83 07 09             addl   $0x9,(%r15)
     c94:   49 83 c7 40             add    $0x40,%r15
     c98:   49 39 df                cmp    %rbx,%r15
     c9b:   75 f3                   jne    c90 <main+0x1a0>
     c9d:   be 02 00 00 00          mov    $0x2,%esi
     ca2:   4c 89 ef                mov    %r13,%rdi
     ca5:   e8 86 fd ff ff          callq  a30 <PAPI_read_counters@plt>
     caa:   85 c0                   test   %eax,%eax
     cac:   0f 85 f3 01 00 00       jne    ea5 <main+0x3b5>
     cb2:   48 8b 8c 24 d0 00 00    mov    0xd0(%rsp),%rcx
     cb9:   00 
     cba:   48 8d 35 97 04 00 00    lea    0x497(%rip),%rsi        # 1158 <_IO_stdin_used+0xe8>
     cc1:   bf 01 00 00 00          mov    $0x1,%edi
     cc6:   31 c0                   xor    %eax,%eax
     cc8:   48 89 ea                mov    %rbp,%rdx
     ccb:   e8 c0 fd ff ff          callq  a90 <__printf_chk@plt>
     cd0:   be 02 00 00 00          mov    $0x2,%esi
     cd5:   4c 89 ef                mov    %r13,%rdi
     cd8:   e8 83 fd ff ff          callq  a60 <PAPI_stop_counters@plt>
     cdd:   85 c0                   test   %eax,%eax
     cdf:   0f 85 72 01 00 00       jne    e57 <main+0x367>
     ce5:   0f ae f0                mfence 
     ce8:   0f 31                   rdtsc  
     cea:   bf 02 00 00 00          mov    $0x2,%edi
     cef:   48 c1 e2 20             shl    $0x20,%rdx
     cf3:   48 89 c3                mov    %rax,%rbx
     cf6:   48 8d 74 24 20          lea    0x20(%rsp),%rsi
     cfb:   48 09 d3                or     %rdx,%rbx
     cfe:   e8 fd fc ff ff          callq  a00 <clock_gettime@plt>
     d03:   bf 01 00 00 00          mov    $0x1,%edi
     d08:   48 be db 34 b6 d7 82    movabs $0x431bde82d7b634db,%rsi
     d0f:   de 1b 43 
     d12:   0f ae f0                mfence 
     d15:   48 8b 4c 24 20          mov    0x20(%rsp),%rcx
     d1a:   48 2b 4c 24 10          sub    0x10(%rsp),%rcx
     d1f:   48 69 c9 00 ca 9a 3b    imul   $0x3b9aca00,%rcx,%rcx
     d26:   48 03 4c 24 28          add    0x28(%rsp),%rcx
     d2b:   48 2b 4c 24 18          sub    0x18(%rsp),%rcx
     d30:   48 89 c8                mov    %rcx,%rax
     d33:   48 c1 f9 3f             sar    $0x3f,%rcx
     d37:   48 f7 ee                imul   %rsi
     d3a:   48 8d 35 3f 04 00 00    lea    0x43f(%rip),%rsi        # 1180 <_IO_stdin_used+0x110>
     d41:   31 c0                   xor    %eax,%eax
     d43:   48 c1 fa 12             sar    $0x12,%rdx
     d47:   48 29 ca                sub    %rcx,%rdx
     d4a:   e8 41 fd ff ff          callq  a90 <__printf_chk@plt>
     d4f:   48 89 da                mov    %rbx,%rdx
     d52:   bf 01 00 00 00          mov    $0x1,%edi
     d57:   31 c0                   xor    %eax,%eax
     d59:   4c 29 f2                sub    %r14,%rdx
     d5c:   48 8d 35 53 03 00 00    lea    0x353(%rip),%rsi        # 10b6 <_IO_stdin_used+0x46>
     d63:   e8 28 fd ff ff          callq  a90 <__printf_chk@plt>
     d68:   31 d2                   xor    %edx,%edx
     d6a:   48 8d 35 56 03 00 00    lea    0x356(%rip),%rsi        # 10c7 <_IO_stdin_used+0x57>
     d71:   31 c0                   xor    %eax,%eax
     d73:   bf 01 00 00 00          mov    $0x1,%edi
     d78:   e8 13 fd ff ff          callq  a90 <__printf_chk@plt>
     d7d:   31 ff                   xor    %edi,%edi
     d7f:   48 8d 74 24 30          lea    0x30(%rsp),%rsi
     d84:   e8 17 fd ff ff          callq  aa0 <getrusage@plt>
     d89:   83 f8 ff                cmp    $0xffffffff,%eax
     d8c:   0f 84 d6 00 00 00       je     e68 <main+0x378>
     d92:   48 8b 8c 24 b8 00 00    mov    0xb8(%rsp),%rcx
     d99:   00 
     d9a:   48 8b 94 24 b0 00 00    mov    0xb0(%rsp),%rdx
     da1:   00 
     da2:   48 8d 35 3e 03 00 00    lea    0x33e(%rip),%rsi        # 10e7 <_IO_stdin_used+0x77>
     da9:   31 c0                   xor    %eax,%eax
     dab:   bf 01 00 00 00          mov    $0x1,%edi
     db0:   e8 db fc ff ff          callq  a90 <__printf_chk@plt>
     db5:   c5 f9 57 c0             vxorpd %xmm0,%xmm0,%xmm0
     db9:   bf 01 00 00 00          mov    $0x1,%edi
     dbe:   c5 fb 10 0d 12 04 00    vmovsd 0x412(%rip),%xmm1        # 11d8 <_IO_stdin_used+0x168>
     dc5:   00 
     dc6:   48 69 44 24 30 40 42    imul   $0xf4240,0x30(%rsp),%rax
     dcd:   0f 00 
     dcf:   48 03 44 24 38          add    0x38(%rsp),%rax
     dd4:   48 8d 35 d5 03 00 00    lea    0x3d5(%rip),%rsi        # 11b0 <_IO_stdin_used+0x140>
     ddb:   c4 e1 fb 2a c0          vcvtsi2sd %rax,%xmm0,%xmm0
     de0:   48 69 54 24 40 40 42    imul   $0xf4240,0x40(%rsp),%rdx
     de7:   0f 00 
     de9:   48 03 54 24 48          add    0x48(%rsp),%rdx
     dee:   c5 fb 59 c1             vmulsd %xmm1,%xmm0,%xmm0
     df2:   c4 e1 fb 2c c0          vcvttsd2si %xmm0,%rax
     df7:   c5 f9 57 c0             vxorpd %xmm0,%xmm0,%xmm0
     dfb:   c4 e1 fb 2a c2          vcvtsi2sd %rdx,%xmm0,%xmm0
     e00:   c5 fb 59 c1             vmulsd %xmm1,%xmm0,%xmm0
     e04:   c4 e1 fb 2c d0          vcvttsd2si %xmm0,%rdx
     e09:   48 01 c2                add    %rax,%rdx
     e0c:   31 c0                   xor    %eax,%eax
     e0e:   e8 7d fc ff ff          callq  a90 <__printf_chk@plt>
     e13:   31 c0                   xor    %eax,%eax
     e15:   48 8b 8c 24 68 01 00    mov    0x168(%rsp),%rcx
     e1c:   00 
     e1d:   64 48 33 0c 25 28 00    xor    %fs:0x28,%rcx
     e24:   00 00 
     e26:   75 51                   jne    e79 <main+0x389>
     e28:   48 81 c4 78 01 00 00    add    $0x178,%rsp
     e2f:   5b                      pop    %rbx
     e30:   5d                      pop    %rbp
     e31:   41 5c                   pop    %r12
     e33:   41 5d                   pop    %r13
     e35:   41 5e                   pop    %r14
     e37:   41 5f                   pop    %r15
     e39:   c3                      retq   
     e3a:   ba 01 00 00 00          mov    $0x1,%edx
     e3f:   48 8d 35 47 02 00 00    lea    0x247(%rip),%rsi        # 108d <_IO_stdin_used+0x1d>
     e46:   bf 01 00 00 00          mov    $0x1,%edi
     e4b:   31 c0                   xor    %eax,%eax
     e4d:   e8 3e fc ff ff          callq  a90 <__printf_chk@plt>
     e52:   e9 5b fd ff ff          jmpq   bb2 <main+0xc2>
     e57:   48 8d 3d 4a 02 00 00    lea    0x24a(%rip),%rdi        # 10a8 <_IO_stdin_used+0x38>
     e5e:   e8 8d fb ff ff          callq  9f0 <puts@plt>
     e63:   e9 7d fe ff ff          jmpq   ce5 <main+0x1f5>
     e68:   48 8d 3d 62 02 00 00    lea    0x262(%rip),%rdi        # 10d1 <_IO_stdin_used+0x61>
     e6f:   e8 7c fb ff ff          callq  9f0 <puts@plt>
     e74:   e9 19 ff ff ff          jmpq   d92 <main+0x2a2>
     e79:   e8 a2 fb ff ff          callq  a20 <__stack_chk_fail@plt>
     e7e:   48 8b 0d 9b 11 20 00    mov    0x20119b(%rip),%rcx        # 202020 <stderr@@GLIBC_2.2.5>
     e85:   ba 18 00 00 00          mov    $0x18,%edx
     e8a:   be 01 00 00 00          mov    $0x1,%esi
     e8f:   48 8d 3d de 01 00 00    lea    0x1de(%rip),%rdi        # 1074 <_IO_stdin_used+0x4>
     e96:   e8 25 fc ff ff          callq  ac0 <fwrite@plt>
     e9b:   bf 01 00 00 00          mov    $0x1,%edi
     ea0:   e8 0b fc ff ff          callq  ab0 <exit@plt>
     ea5:   89 c7                   mov    %eax,%edi
     ea7:   e8 d4 fb ff ff          callq  a80 <PAPI_strerror@plt>
     eac:   48 8b 3d 6d 11 20 00    mov    0x20116d(%rip),%rdi        # 202020 <stderr@@GLIBC_2.2.5>
     eb3:   be 01 00 00 00          mov    $0x1,%esi
     eb8:   48 8d 15 49 02 00 00    lea    0x249(%rip),%rdx        # 1108 <_IO_stdin_used+0x98>
     ebf:   48 89 c1                mov    %rax,%rcx
     ec2:   31 c0                   xor    %eax,%eax
     ec4:   e8 07 fc ff ff          callq  ad0 <__fprintf_chk@plt>
     ec9:   bf 01 00 00 00          mov    $0x1,%edi
     ece:   e8 dd fb ff ff          callq  ab0 <exit@plt>
     ed3:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
     eda:   00 00 00 
     edd:   0f 1f 00                nopl   (%rax)

0000000000000ee0 <_start>:
     ee0:   31 ed                   xor    %ebp,%ebp
     ee2:   49 89 d1                mov    %rdx,%r9
     ee5:   5e                      pop    %rsi
     ee6:   48 89 e2                mov    %rsp,%rdx
     ee9:   48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
     eed:   50                      push   %rax
     eee:   54                      push   %rsp
     eef:   4c 8d 05 6a 01 00 00    lea    0x16a(%rip),%r8        # 1060 <__libc_csu_fini>
     ef6:   48 8d 0d f3 00 00 00    lea    0xf3(%rip),%rcx        # ff0 <__libc_csu_init>
     efd:   48 8d 3d ec fb ff ff    lea    -0x414(%rip),%rdi        # af0 <main>
     f04:   ff 15 d6 10 20 00       callq  *0x2010d6(%rip)        # 201fe0 <__libc_start_main@GLIBC_2.2.5>
     f0a:   f4                      hlt    
     f0b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000000f10 <deregister_tm_clones>:
     f10:   48 8d 3d f9 10 20 00    lea    0x2010f9(%rip),%rdi        # 202010 <__TMC_END__>
     f17:   55                      push   %rbp
     f18:   48 8d 05 f1 10 20 00    lea    0x2010f1(%rip),%rax        # 202010 <__TMC_END__>
     f1f:   48 39 f8                cmp    %rdi,%rax
     f22:   48 89 e5                mov    %rsp,%rbp
     f25:   74 19                   je     f40 <deregister_tm_clones+0x30>
     f27:   48 8b 05 aa 10 20 00    mov    0x2010aa(%rip),%rax        # 201fd8 <_ITM_deregisterTMCloneTable>
     f2e:   48 85 c0                test   %rax,%rax
     f31:   74 0d                   je     f40 <deregister_tm_clones+0x30>
     f33:   5d                      pop    %rbp
     f34:   ff e0                   jmpq   *%rax
     f36:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
     f3d:   00 00 00 
     f40:   5d                      pop    %rbp
     f41:   c3                      retq   
     f42:   0f 1f 40 00             nopl   0x0(%rax)
     f46:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
     f4d:   00 00 00 

0000000000000f50 <register_tm_clones>:
     f50:   48 8d 3d b9 10 20 00    lea    0x2010b9(%rip),%rdi        # 202010 <__TMC_END__>
     f57:   48 8d 35 b2 10 20 00    lea    0x2010b2(%rip),%rsi        # 202010 <__TMC_END__>
     f5e:   55                      push   %rbp
     f5f:   48 29 fe                sub    %rdi,%rsi
     f62:   48 89 e5                mov    %rsp,%rbp
     f65:   48 c1 fe 03             sar    $0x3,%rsi
     f69:   48 89 f0                mov    %rsi,%rax
     f6c:   48 c1 e8 3f             shr    $0x3f,%rax
     f70:   48 01 c6                add    %rax,%rsi
     f73:   48 d1 fe                sar    %rsi
     f76:   74 18                   je     f90 <register_tm_clones+0x40>
     f78:   48 8b 05 71 10 20 00    mov    0x201071(%rip),%rax        # 201ff0 <_ITM_registerTMCloneTable>
     f7f:   48 85 c0                test   %rax,%rax
     f82:   74 0c                   je     f90 <register_tm_clones+0x40>
     f84:   5d                      pop    %rbp
     f85:   ff e0                   jmpq   *%rax
     f87:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
     f8e:   00 00 
     f90:   5d                      pop    %rbp
     f91:   c3                      retq   
     f92:   0f 1f 40 00             nopl   0x0(%rax)
     f96:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
     f9d:   00 00 00 

0000000000000fa0 <__do_global_dtors_aux>:
     fa0:   80 3d 81 10 20 00 00    cmpb   $0x0,0x201081(%rip)        # 202028 <completed.7696>
     fa7:   75 2f                   jne    fd8 <__do_global_dtors_aux+0x38>
     fa9:   48 83 3d 47 10 20 00    cmpq   $0x0,0x201047(%rip)        # 201ff8 <__cxa_finalize@GLIBC_2.2.5>
     fb0:   00 
     fb1:   55                      push   %rbp
     fb2:   48 89 e5                mov    %rsp,%rbp
     fb5:   74 0c                   je     fc3 <__do_global_dtors_aux+0x23>
     fb7:   48 8b 3d 4a 10 20 00    mov    0x20104a(%rip),%rdi        # 202008 <__dso_handle>
     fbe:   e8 1d fb ff ff          callq  ae0 <__cxa_finalize@plt>
     fc3:   e8 48 ff ff ff          callq  f10 <deregister_tm_clones>
     fc8:   c6 05 59 10 20 00 01    movb   $0x1,0x201059(%rip)        # 202028 <completed.7696>
     fcf:   5d                      pop    %rbp
     fd0:   c3                      retq   
     fd1:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
     fd8:   f3 c3                   repz retq 
     fda:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000000fe0 <frame_dummy>:
     fe0:   55                      push   %rbp
     fe1:   48 89 e5                mov    %rsp,%rbp
     fe4:   5d                      pop    %rbp
     fe5:   e9 66 ff ff ff          jmpq   f50 <register_tm_clones>
     fea:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000000ff0 <__libc_csu_init>:
     ff0:   41 57                   push   %r15
     ff2:   41 56                   push   %r14
     ff4:   49 89 d7                mov    %rdx,%r15
     ff7:   41 55                   push   %r13
     ff9:   41 54                   push   %r12
     ffb:   4c 8d 25 36 0d 20 00    lea    0x200d36(%rip),%r12        # 201d38 <__frame_dummy_init_array_entry>
    1002:   55                      push   %rbp
    1003:   48 8d 2d 36 0d 20 00    lea    0x200d36(%rip),%rbp        # 201d40 <__init_array_end>
    100a:   53                      push   %rbx
    100b:   41 89 fd                mov    %edi,%r13d
    100e:   49 89 f6                mov    %rsi,%r14
    1011:   4c 29 e5                sub    %r12,%rbp
    1014:   48 83 ec 08             sub    $0x8,%rsp
    1018:   48 c1 fd 03             sar    $0x3,%rbp
    101c:   e8 9f f9 ff ff          callq  9c0 <_init>
    1021:   48 85 ed                test   %rbp,%rbp
    1024:   74 20                   je     1046 <__libc_csu_init+0x56>
    1026:   31 db                   xor    %ebx,%ebx
    1028:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
    102f:   00 
    1030:   4c 89 fa                mov    %r15,%rdx
    1033:   4c 89 f6                mov    %r14,%rsi
    1036:   44 89 ef                mov    %r13d,%edi
    1039:   41 ff 14 dc             callq  *(%r12,%rbx,8)
    103d:   48 83 c3 01             add    $0x1,%rbx
    1041:   48 39 dd                cmp    %rbx,%rbp
    1044:   75 ea                   jne    1030 <__libc_csu_init+0x40>
    1046:   48 83 c4 08             add    $0x8,%rsp
    104a:   5b                      pop    %rbx
    104b:   5d                      pop    %rbp
    104c:   41 5c                   pop    %r12
    104e:   41 5d                   pop    %r13
    1050:   41 5e                   pop    %r14
    1052:   41 5f                   pop    %r15
    1054:   c3                      retq   
    1055:   90                      nop
    1056:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
    105d:   00 00 00 

0000000000001060 <__libc_csu_fini>:
    1060:   f3 c3                   repz retq 

Disassembly of section .fini:

0000000000001064 <_fini>:
    1064:   48 83 ec 08             sub    $0x8,%rsp
    1068:   48 83 c4 08             add    $0x8,%rsp
    106c:   c3                      retq

I have an object size of 64, and I added initialization as well:

typedef struct _object{
  int value;
  int pad_0;
  int * pad_2;
  int * pad_3;
  int * pad_4;
  int * pad_5;
  int * pad_6;
  int * pad_7;
  int * pad_8;
} object;  

object * array;
int arr_size = 1000;
array = (object *) malloc(arr_size * sizeof(object));
for(int i=0; i < arr_size; i++){
      array[i].value = 1;
    }

Charity answered 11/2, 2019 at 20:19 Comment(14)

You're only looping over 100 elements once after making a system call, and some measurement overhead is probably part of those 100 misses. How does the count scale with array size? (That tells you overhead and whether your measurement is even sane). But anyway, looping multiple times over the same array could give you many more L1d hits, enough to drown out overhead and the effects of library + system call maybe evicting your data from L1d. – Stooge 12/2, 2019 at 8:19

What are you talking about with overwriting a function with asm()? You don't need self-modifying code or anything weird, just normal GNU C inline asm. Or __builtin_ia32_rdpmc, but software.intel.com/en-us/forums/… reports that GCC 5 might not be treating it as volatile and optimizing two calls to both use the same value. Apparently that's still a bug in gcc8.2: godbolt.org/z/p8c_ND. – Stooge 12/2, 2019 at 8:25

Update, __rdpmc(int) is properly volatile in GCC 6.5, 7.4+ and 8.3+ (and upcoming 9.0 trunk) gcc.gnu.org/bugzilla/show_bug.cgi?id=87550. Turns out it had already been reported and fixed, but only GCC trunk on Godbolt was new enough to have it. – Stooge 12/2, 2019 at 9:25

@PeterCordes miss_2 counts increase with array size increase. For rdpmc all I have found was this: unsigned int a, d; __asm __volatile("rdpmc" : "=a"(a), "=d"(d) : "c"(ecx)); return ((uint64_t)a) | (((uint64_t)d) << 32); Which I did not understood what parameters to give. Is __rdpmc(0) with parameter 0 supposed to count cache misses? I really have not found anything useful for rdpmc before so I am still not sure how to use it correctly – Charity 12/2, 2019 at 22:24

There are a few programmable counters and a couple fixed counters. ECX selects which one. It has to already be programmed ahead of time to be counting a specific event. I think PAPI is supposed to have a way to enable user-space RDPMC, and to find out which counter index is programmed for which event. – Stooge 13/2, 2019 at 0:18

Your disassembly shows you compiled with optimization disabled, so all modified variables are getting stored to the stack after every statement. (And everything reloaded in the next). So those cache lines will probably stay hot, but it's extra memory stores/loads that you wouldn't expect, and that will affect counts for MEM_LOAD_RETIRED.L1_HIT if you're looking at just that, instead of the all - hit difference. – Stooge 13/2, 2019 at 14:57

gcc -o prog6 ./program_6.c -O3 -march=native -lpapi I am compiling with this. I have not manage to try MEM_LOAD_RETIRED it. -O3 is supposed to optimizing not? – Charity 13/2, 2019 at 15:11

The disassembly you show appears to be from an unlinked .o (note the rel32 = 0 offset in E8000000 00 call PAPI_read_counters@PLT). -O3 is full optimization, but whatever you disassembled was not the output of that gcc command. That's just obvious from instructions like movl $0, %eax instead of xor %eax,%eax, and movl %eax, -468(%rbp) / cmpl $0, -468(%rbp) instead of test %eax,%eax. – Stooge 13/2, 2019 at 15:15

You can use objdump -d on the executable binary to emit the assembly code. It may be helpful to see the whole assembly code, so you can just copy it from your terminal and paste it here. We need to also see how the array is exactly defined and allocated. The miss2 numbers you mentioned in the comments suggest that the number of L1 replacements is about equal to the number of array elements, which means that the array is not in the cache for some reason. – Cambium 13/2, 2019 at 15:35

@PeterCordes Yes you are right, I when compiling assembly code I was missing -O3. I changed it – Charity 13/2, 2019 at 16:31

@HadiBrais I edited it as you said. – Charity 13/2, 2019 at 16:37

Just to be clear, the numbers from your comment under my answer (that basically say that miss2 is about equal to the number of array elements) are from the optimized version with initialization or some other version? – Cambium 13/2, 2019 at 17:1

It's this version, the assembly that I last modified is from that compiled version. gcc -o prog6 ./program_6.c -O3 -march=native -lpapi compiled with this – Charity 13/2, 2019 at 17:6

I've updated my answer. – Cambium 13/2, 2019 at 19:6

I've done some experiments using LIKWID, which is similar to PAPI, on Haswell. I found out that the calls to the functions that initialize and read the performance counters can cause more than 600 replacements in the L1 cache. Since the L1 cache has only 512 lines, this means that these functions may evict many of the lines that you would otherwise expect to be in the L1. By looking at the relatively large source code of PAPI_start_counters and _internal_hl_read_cnts, it seems to me that these functions may evict many lines from the L1, so the array elements don't survive in the L1 across these calls. I've verified this by using loads instead of stores and counting hits and misses using MEM_LOAD_RETIRED.*. I think the solution would be to use the RDPMC instruction. I have not used this instruction directly before. The code snippets here look useful.

Alternatively, you can put two copies of the loop after PAPI_start_counters/PAPI_read_counters and then subtract from the results the counts for one copy of the loop. This method works well.

By the way, the L1D.REPLACEMENT counter seems to be fairly accurate on Haswell when the number of cache lines accessed is about larger than 10. Perhaps the count would be exact by using RDPMC.

From your previous question, it seems that you're on Skylake. According to the PAPI event mapping, PAPI_L1_DCM and PAPI_L2_TCM are mapped to L1D.REPLACEMENT and LONGEST_LAT_CACHE.REFERENCE performance monitoring events on Intel processors. These are defined in the Intel manual as follows:

L1D.REPLACEMENT: Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.

LONGEST_LAT_CACHE.REFERENCE: This event counts core-originated cacheable demand requests that refer to the last level cache (LLC). Demand requests include loads, RFOs, and hardware prefetches from L1D, and instruction fetches from IFU.

Without getting into the details of when these events exactly occur, there are three important points here that are relevant to your question:

Both events are counted at the cache-line granularity, not x86 instruction or load uop granularities.
These events may occur due to the L1D hardware prefetchers. This can impact miss2.
There is no way to count L1D hits at the cache line granularity for a specific physical or logical core using these events (or any other set of events on SnB-based micoarchitectures).

On Skylake, there are other native events that you can use to count L1D misses and hits per load instruction. You can use MEM_LOAD_RETIRED.L1_HIT to count the number of retired load instructions that hit in the L1D. You can use MEM_INST_RETIRED.ALL_LOADS-MEM_LOAD_RETIRED.L1_HIT to count the number of retired load instructions that miss in the L1D. There doesn't seem to be PAPI events for them. According to the documentation, you can pass native event codes to PAPIF_start_counters.

Another issue is that it's not clear to me whether PAPIF_start_counters by default will count only user events of both kernel and user events. It seems that you can use PAPI_create_eventset to control the counting domain.

The calls to PAPI APIs can also impact the event counts. You can try to measure this using an empty block as follows:

if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
   fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
   exit(1);
}

// Nothing.

if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
    fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
    exit(1);
}

This measurement will give you an estimate of the error that may occur due to PAPI itself.

Also, I don't think you need to use _mm_mfence.

Cambium answered 12/2, 2019 at 7:52 Comment(24)

perf list shows events for mem_load_retired.l1_hit, in case that helps. (For a while you needed the ocperf.py wrapper, but in the last year or so perf itself gained support for a lot of uarch-specific named events. But that might just be in the perf front-end, not the PAPI library itself if perf even uses that.) – Stooge 12/2, 2019 at 8:10

Hello, thank you for helping. Well my assumption was that there should not be need of prefetching as I am reading entire array before I start counters, and it should be all in cache when I start second loop and defining misses. I tried just calculation of overhead and it varies mostly up to 30 in average. For PAPIF_start_counters to add event I found this: if (PAPI_event_name_to_code("MEM_LOAD_RETIRED.L1_HIT",&native) != PAPI_OK). Well it compiles and does not give errors, I will try to find usage of it as well. Thank you again very much. I will write what I will manage to find and try. – Charity 12/2, 2019 at 23:3

@AnaKhorguani The L1D on Skylake has two hardware prefetchers and I don't know exactly how they work on that micoarchitecture. It'd be helpful if you show us the assembly code between the PAPI_read_counters calls to see the actual assembly code that will get executed. Also showing miss1 and miss2 for different array sizes (such as 100, 1K, 10K, 100K, 1M, 10M, 100M) would be nice. Also if I'm not mistaken, the size of an array element is 4 bytes, right? – Cambium 13/2, 2019 at 5:0

miss2 = values[0]; .loc 1 107 0 movq -160(%rbp), %rax movq %rax, -408(%rbp) program_6.c printf("before flush miss_1 %lli, miss_2 %lli \n", miss1, miss2); .loc 1 109 0 movq -408(%rbp), %rdx movq -432(%rbp), %rax movq %rax, %rsi leaq .LC4(%rip), %rdi movl $0, %eax .LBB42: program_6.c for(int i=0; i < arr_size; i++){ .loc 1 113 0 movl $0, -496(%rbp) jmp .L15 program_6.c _mm_clflush(&array[i]); .loc 1 114 0 discriminator 3 GAS LISTING /tmp/ccofwWT8.s page 40 ovl -496(%rbp), %eax cltq salq $6, %rax movq %rax, %rdx movq -440(%rbp), %rax addq %rdx, %rax – Charity 13/2, 2019 at 14:20

movq %rax, -360(%rbp) .LBB43: .LBB44: /usr/lib/gcc/x86_64-linux-gnu/7/include/emmintrin.h **** } .loc 2 1485 0 discriminator 3 movq -360(%rbp), %rax clflush (%rax) .LBE44: .LBE43: 113:program_6.c _mm_clflush(&array[i]); .loc 1 113 0 discriminator 3 addl $1, -496(%rbp) .L15: program_6.c _mm_clflush(&array[i]); .loc 1 113 0 is_stmt 0 discriminator 1 movl -496(%rbp), %eax cmpl -484(%rbp), %eax jl .L16 .LBE42: program_6.c **** if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) { 346 .loc 1 118 0 is_stmt 1 – Charity 13/2, 2019 at 14:21

this is compiled version but I am not sure it's readable. The size of each array element is 64 bytes, size of cache line. now it's 200 and average of miss_2 is 150. with 300 it gets 250, with 500 it's average than 500. with 1000 miss_2 is near 1000 as well. Well increasing array size does not exactly match the idea, because I was planning to make array which would fit in L1 cache, it's size is 32 Kb with me. that is why I was trying with the smaller size. – Charity 13/2, 2019 at 14:27

@AnaKhorguani It's difficult to read this in the comments. You can edit your question and add all of this information by clicking on the "edit" button right below the tags. – Cambium 13/2, 2019 at 14:39

@HadiBrais Yes I updated the post, I am not sure what to read from this to be honest, so if more code is needed let me know – Charity 13/2, 2019 at 14:47

Thank you very much. Well I just started playing with MEM_INST_RETIRED.ALL_LOADS, MEM_LOAD_RETIRED.L1_HIT, MEM_LOAD_RETIRED.L1_MISS (you have mentioned calculating misses with first two, but I found this event and decided to try). instead of PAPI_start_counters and PAPI_read_counters I use PAPI_start and PAPI_read. though I am not sure exactly what I am counting yet. The results are as usual unexpected so I will try to see some pattern. – Charity 13/2, 2019 at 20:44

But I checked the git source code which you mentioned for PAPI_read_counters and apparently it's using PAPI_read, so I guess PAPI_read will cause same cache replacements. Though I am trying now to read one element, so I know it's in cache, then read counter, then read this element again and after, read counter again. well my naive hope was that count of misses would be 0, so I would know for sure my read of element caused miss or not. But since this is not realistic I am not sure if such fine grain check of counters will give useful result. – Charity 13/2, 2019 at 20:50

I will try 2 loops after PAPI_start_counters/PAPI_read_counters. let's say I loop over 100 elements, I will have 2 loops and I will subtract 100 from the read counter result. right? Using RDPMC is one of my goal, I hoped I could use __rdpmc() as a ready function, similar to PAPI_read_counters for example, as _asm __volatile("rdpmc" : "=a"(a), "=d"(d) : "c"(ecx)); return ((long long)a) | (((long long)d) << 32); here it's a bit confusing still what exactly it's counting and how should I use it for caches, but I will read it more carefully and try to see if I manage to use it. Thank you again – Charity 13/2, 2019 at 21:1

@AnaKhorguani Right. Note that MEM_INST_RETIRED.ALL_LOADS is approximately equal to MEM_LOAD_RETIRED.L1_MISS + MEM_LOAD_RETIRED.L1_HIT + MEM_LOAD_RETIRED.FB_HIT. That is, the number of load instructions that miss in the L1 is MEM_LOAD_RETIRED.L1_MISS +MEM_LOAD_RETIRED.FB_HIT. – Cambium 13/2, 2019 at 21:10

@HadiBrais Hello, I have one question about prefetching counting as misses that you have mentioned. Will MEM_LOAD_RETIRED.L1_MISS see them as misses as well? I think I have found easy way to disable and enable prefetchers. So I see the time difference somewhere 100 msec in this two states, counting misses, I think, also gives some change but what I found is that even before disabling prefetching, miss count was close to what I would expect without prefethers. So now I think this even sees them as misses as well. – Charity 21/2, 2019 at 10:17

@AnaKhorguani MEM_LOAD_RETIRED.L1_MISS only counts demand loads and software prefetches, not hardware prefetches. However, MEM_LOAD_RETIRED.FB_HIT may be incremented when a load miss hit an FB that was allocated for a hardware prefetch. Also the hardware prefetchers may evict lines that you need, potentially turning accesses into misses that otherwise would be hits. Which solution have you followed? Using RDPMC or two copies of the loop and subtracting the impact of the first one? – Cambium 21/2, 2019 at 10:54

@HadiBrais Right now I am testing if I correctly managed to disable hardware prefetchers. I create an array of 100,000 million integer elements, I fill it, then evict all elements with clflush (even if the element is not in cache it does not hurt, I believe). Then I start papi, read it (at this point there is nothing to evict) then I loop over an array, reading and modifying each element and after I read papi evens again. So even if they are evicting some cache lines this will not affect any more. So when prefetchers are enabled, MEM_LOAD_RETIRED.L1_MISS gives near 62,000 misses. – Charity 21/2, 2019 at 11:11

So this is almost as the size of the array divided over 16 (as each cache line holds 16 elements now). Which means I get misses for accessing each 16th element. But prefetchers are the ones that should take care that even before new element is asked, it's already loaded to the cache. maybe I am expecting too high difference with and without prefetchers. – Charity 21/2, 2019 at 11:12

@AnaKhorguani It's not that really that simple. Getting a miss per cache line is certainly possible even if all prefetchers are enabled. I presume then that MEM_LOAD_RETIRED.FB_HIT counts 15 per line and MEM_LOAD_RETIRED.L1_HIT is basically close to zero. – Cambium 21/2, 2019 at 11:21

@HadiBrais L1 hit is not zero, it's several hundreds. But FB_HIT is 0 when I read them with papi at the end of the loop. – Charity 21/2, 2019 at 11:29

@AnaKhorguani Hmm, you can post a new question (with the code and how you are compiling it). Just to be clear, the size of the array is 1 million 4-byte integers, not 100,000 million integers (that is 100 billion), right? – Cambium 21/2, 2019 at 11:34

@HadiBrais the number of elements in array is 1 million, and each has size of 4 bytes. Ok I will post it as a new question, that I am reading MEM_LOAD_RETIRED.L1_MISS and there is no effect seen by prefetching. Or is it about FB_HIT? I have not searched exactly what it is for. I saw this: Retired load uops which missed L1 but hit line fill buffer (LFB), but not sure what LFB is for. – Charity 21/2, 2019 at 12:11

@HadiBrais I am sorry the result was not correct, I was getting papi error which I have not noticed. the result I get is: ALL_LOADS: 125134, L1_HIT: 796, L1_MISS: 60946, FB_HIT: 63360 after iterating this whole 1 000 000 integers. – Charity 21/2, 2019 at 19:1

@AnaKhorguani It seems that your compiler has used AVX2 to emit a single load instruction for each 8 integers, resulting in about 125134 load instructions, each 32-byte wide. The first load from a line is counted as L1_MISS and the second one is counted as FB_HIT. We have L1_MISS+FB_HIT+L1_HIT ~= ALL_LOADS, as expected. – Cambium 21/2, 2019 at 19:57

@HadiBrais Thank you very much. I played a lot with papis events and prefetchers together. Yes I was compiling with -O3 option which apparently optimized loads to issue less requests. Finally the results I got make sense. I have decided not to try too fine-grained approach, read counter after every load of element. Any case it's too hard to say exactly where one concrete miss comes from, so I will observe entire loop results for example. Thank you again, your comments were very helpful and got me necessary hints. – Charity 22/2, 2019 at 14:12

Also I am sticking with PAPI_read so far. As I will read once before start of loop and after finishing loop, even if it evicts some cache lines before or after it won't affect my data and it's overhead is not that important for my observations, so I think it should be ok. – Charity 22/2, 2019 at 14:21

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags