Using rdmsr/rdpmc for branch prediction accuracy
Asked Answered
M

2

9

I am trying to understand how does a branch prediction unit work in a CPU.

I have used papi and also linux's perf-events but both of them do not give accurate results (for my case).

This is my code:

void func(int* arr, int sequence_len){
  for(int i = 0; i < sequence_len; i++){
      // region starts
      if(arr[i]){
          do_sth();
      }
      // region ends
  }
}

My array consists of 0's and 1's. It has a pattern with a size of sequence_len. For example, if my size is 8, then it has a pattern of 0 1 0 1 0 0 1 1 or something like that.

Trial 1:

I am trying to understand how CPU predicts those branches. So, I have used papi and set up performance counter for branch predictions mispredicted (I know that it also counts indirect branches).

int func(){
  papi_read(r1);
  for(){
    //... same as above
  }
  papi_read(r2);
  return r2-r1;
}

int main(){
   init_papi();
   for(int i = 0; i < 10; i++)
     res[i] = func();

   print(res[i]);
}

What I see as an output is that (for sequence length of 200)

100 #iter1
40  #iter2
10  #iter3
3
0
0
#...

So, at first, the CPU blindly predicts the sequence, only success half of the time. In the next iterations, the CPU can predict better and better. After some amount of iterations, the CPU can guess that perfectly.

Trial 2

I would like to see, at which array index does the CPU misprediction.

int* func(){
  int* results;
  for(){
    papi_read(r1);
    if(arr[i])
        do_sth();   
    papi_read(r2);
    res[i] = r2-r1;
  }
  return res;
}

int main(){
   init_papi();
   for(int i = 0; i < 10; i++)
     res[i] = func();

   print(res[i]);
}

Expected result:

#1st iteration, 0 means no mispred, 1 means mispred
1 0 0 1 1 0 0 0 1 1 0... # total of 200 results
Mispred: 100/200
#2nd iteration
0 0 0 0 1 0 0 0 1 0 0... # total of 200 results
Mispred: 40/200 # it learned from previous iteration
#3rd iteration
0 0 0 0 0 0 0 0 1 0 0... # total of 200 results
Mispred: 10/200 # continues to learn
#...

Received result:

#1st iteration
1 0 0 1 1 0 0 0 1 1 0... # total of 200 results
Mispred: 100/200
#2nd iteration
1 0 0 0 1 1 0 1 0 0 0... # total of 200 results
Mispred: 100/200 # it DID NOT learn from previous iteration
#3rd iteration
0 1 0 1 0 1 0 1 1 0 0... # total of 200 results
Mispred: 100/200 # NO LEARNING
#...

My observation

When I measure the misprediction outside of the for loop, I can see that CPU learns from its mispredictions. However, when I try to measure single branch instructions misprediction, then the CPU either cannot learn, or I am measuring it wrongly.

My explanation

I am giving 200 as a sequence length. The CPU has one small branch predictor, like 2-3 bit saturated counter in Intels, and one big global branch predictor. When I measure outside of the loop, I introduce less noise to the measurement. By less noise, I mean the papi calls.

Think about this: outside of the loop measurement

global history is: papi_start, branch_outcome1, branch_outcome2, branch_outcome3, ..., papi_end, papi_start (2nd loop of main iteration), branch_outcome1, ...

So, the branch predictor somehow finds the pattern in the same branch.

However, if I try to measure single branch instruction then the global history is: papi_start, branchoutcome1, papiend, papistart, branchoutcome2, papiend...

So, I am introducing more and more branches to global history. I assume the global history cannot hold many branch entries and therefore, it cannot find any correlation/pattern in the desired if statement(branch).

As a result

I need to measure a single branch prediction outcome. I know that the CPU can learn the 200 pattern if I don't introduce papi too much. I have looked at the papi calls and I have seen lots of for loops, if conditions.

That is why I need better measurement. I have tried linux perf-event but it makes ioctl calls, which is a system call and I pollute the global history with system calls, and therefore, not a good measurement.

I have read that rdpmc and rdmsr instructions and I assume that since they are only instructions, I will not pollute the global history, and I can measure single branch instruction at a time.

However, I have no clue about how I can do that. I have AMD 3600 CPU. These are the links that I found online but I couldn't figure out how to do that. In addition to that, am I missing something?

Intel rdpmc

AMD Performance manual

Mickeymicki answered 17/2, 2020 at 14:51 Comment(4)
Why not trying on a bare metal software ? on an ARM microcontroller for instance. the behavior would be more predictable and easier to debug since there is no OS ?Internationalism
there is a nice article speaking about measuring branch prediction on ARM cortex here : community.arm.com/developer/ip-products/processors/b/…Internationalism
Well, I want to measure the AMD processor. I think your link does not provide a valuable answer to my question. But I'll look into that just to learn new things. @InternationalismMickeymicki
@The_Average_Engineer: x86 CPUs boot up in real mode, and there's always firmware built-in to the motherboard which either loads a UEFI application or a legacy BIOS boot sector. It's not like an ARM board where you're basically writing the firmware into flash. I don't think bare metal (or even running under UEFI) is a very useful suggestion. At least a UEFI application wouldn't have to do a bunch of osdev crap (like setting up a GDT and page tables) just to run normal 64-bit code, and could use UEFI functions to save results to a file. But you wouldn't have a debugger or anything.Endothermic
P
6

You have assumed that the PAPI and/or perf_events code has a relatively light footprint. This is incorrect. If you change the performance counter event to something like "instructions retired" or "CPU cycles not halted", you will be able to see how much overhead this operation contains in your software environment. The details will depend on your OS version, but I expect the overhead to be in the hundreds of instructions/thousands of cycles because of the kernel crossing required to read the counters in perf_events (which is used by PAPI). The code path will certainly include its own branches.

If your kernel supports "User-Mode RDPMC" (CR4.PCE=1), you can read a performance counter with a single instruction. Examples are available in https://github.com/jdmccalpin/low-overhead-timers.

Even when limiting the measurement code to the native RDPMC instruction (and the surrounding code to save the results), measurements are disruptive to the processor pipeline. RDPMC is a microcoded instruction. On the Ryzen core, the instruction executes 20 micro-ops and has a throughput of one instruction per 20 cycles. (Ref: https://www.agner.org/optimize/instruction_tables.pdf)

Any measurements at fine granularities are challenging because the out-of-order capabilities of modern processors interact with the user code in ways that are poorly documented and difficult to anticipate. More notes on this topic (also relevant to AMD processors) are at http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/

Pester answered 17/2, 2020 at 16:58 Comment(1)
Further information on how to perform low-overhead performance measurements can also be found in the following paper: arxiv.org/abs/1911.03282Into
D
5

The perf_event_open() documentation describes how to correctly use rdpmc with events created via that interface. The approach described in @JohnDMcCalpin's answer also works, but it's based on programming the event control registers directly. Given a set of hardware events, figuring out how to schedule these events on the available hardware performance counters can be difficult. The perf_event subsystem handles this problem for you, which is a major advantage.

The perf_event subsystem supports rdpmc since Linux 3.4.

Starting with <linux/perf_event.h>, the following works:

  1. do perf_event_open() to prepare to read counter of type = PERF_TYPE_HARDWARE config = PERF_COUNT_HW_BRANCH_MISSES

    struct perf_event_attr attr ;
    int fd ;
    
    memset(&attr, 0, sizeof(attr)) ;
    
    attr.type   = PERF_TYPE_HARDWARE ;
    attr.config = PERF_COUNT_HW_BRANCH_MISSES;
    attr.size = sizeof(attr) ;        // for completeness
    attr.exclude_kernel = 1 ;         // count user-land events
    
    perf_fd = (int)sys_perf_event_open(&attr, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) ;
                                      // this pid, any cpu, no group_fd
    

    where:

    static long
    sys_perf_event_open(struct perf_event_attr* attr,
                                  pid_t pid, int cpu, int group_fd, ulong flags)
    {
      return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags) ;
    }
    
  2. associate the perf_fd with a mmap page:

    struct perf_event_mmap_page* perf_mm ;
    
    perf_mm = mmap(NULL, page_size, PROT_READ, MAP_SHARED, perf_fd, 0) ;
    

    page_size can be 4096 for example. This buffer is used for storing samples. See the "Overflow handling" section of the documentation.

  3. to read the counter need to combine some information in the perf_mm with what you read using RDPMC instruction, thus:

    uint64_t  offset, count ;
    uint32_t  lock, check, a, d, idx ;
    
    lock = perf_mm->lock ;
    do
      {
        check = lock ;
        __asm__ volatile("":::"memory") ;
        idx = perf_mm->index - 1 ;
        // Check that you're allowed to execute rdpmc. You can do this check once.
        // Check also that the event is currently active.
        // Starting with Linux 3.12, use cap_user_rdpmc.
        if (perf_mm->cap_user_rdpmc && idx) {
           // cap_user_rdpmc cannot change at this point because no code
           // that executes here that changes it. So it's safe.
           __asm__ volatile("\t rdpmc\n" : "=a" (a), "=d" (d) : "c" (idx)) ;
        }
        // In case of signed event counts, you have to use also pmc_width.
        // See the docs.
         offset = perf_mm->offset ;
        __asm__ volatile("":::"memory") ;
        lock = perf_mm->lock ;
      }
    while (lock != check) ;
    
    count = ((uint64_t)d << 32) + a ;
    if (perf_mm->pmc_width != 64)
      {
        // need to sign extend the perf_mm->pmc_width bits of count.
      } ;
    count += offset ;
    

    If the thread is not interrupted between the "start" and "end" reads, then I think we can assume that the perf_mm stuff will not change. But if it is interrupted, then the kernel can update perf_mm stuff to account for any changes that affect this timing.

  4. Note: The overhead around the RDPMC instructions is not huge, but I am experimenting with stripping all this back and seeing whether I can use the RDPMC results directly, provided that perf_mm->lock does not change.

Dewy answered 17/2, 2020 at 17:19 Comment(6)
There is a __rdpmc intrinsic, but apparently it was buggy until gcc6.5 / 7.4 / 8.3 ; before that it wasn't properly volatile. If you have newer GCC you can use it; but I guess inline asm is fine. You left out C vars for the output of rdpmc. Normally you want "=a"(low_half_result) or something. It's a syntax error to omit the (var_name) part.Endothermic
Thanks. Fixed to "=a" (a), "=d" (d).Dewy
@Hadi: thanks for the edits. Is it necessary to check if (pc->cap_user_rdpmc && idx) in the read loop ? I mentioned time_offset etc because the code sample in the documentation to show how to use rdpmc uses it, but it's not necessary to do so for these purposes. You changed the page_size to say "4096 for example": do you mean it can be 4096 for this purpose -- namely, reading PERF_TYPE_HARDWARE counters using rdpmc ? You also pointed at "Overflow handling" in the "documentation": how is that relevant in this case ? Finally: how do I tell when I have a "signed event count" ?Dewy
@ChrisHall idx is invalid if the event is not currently active (e.g., due to multiplexing). If you try to rdpmc from a invalid idx, you'll either read the counter of a different event or an exception will occur. It may be sufficient to check for cap_user_rdpmc only once at the beginning of the program if you know for sure that no one else will may disable user-mode rdpmc later for some reason. That buffer is used for holding event samples. When the buffer gets fall, the kernel invokes function you registered to process the buffer. The documentation discusses how the buffer is used.Bulley
@HadiBrais: I have failed to discover whether the hardware configuration of counters is per thread, per process or system wide. If per thread/process, then I guess I can assume that idx will change only if/when I call open_perf_event() ? I see cap_user_rdpmc can be changed from outside the process, but I don't care if a timing test then fails -- I care about the timing overhead ! The "documentation" (or FM) says "Overflows are generated only for sampling events"... I think I am configuring "counter events", so only require 1 page for the metadata because no ring buffer is required ?Dewy
@ChrisHall They're per thread, but a single thread can schedule more hardware events than there are hardware counters, which triggers multiplexing. That's how some events can be enabled but not active. Sure, you can remove the cap_user_rdpmc if you can guarantee that user-mode rdpmc is enabled at the time it's executed. Otherwise, the code will crash.Bulley

© 2022 - 2024 — McMap. All rights reserved.