Performance of "conditional call" on amd64
Asked Answered
M

2

11

When considering a conditional function call in a critical section of code I found that both gcc and clang will branch around the call. For example, for the following (admittedly trivial) code:

int32_t __attribute__((noinline)) negate(int32_t num) {
    return -num;
}

int32_t f(int32_t num) {
    int32_t x = num < 0 ? negate(num) : num;
    return 2*x + 1;
}

Both GCC and clang compile to essentially the following:

.global _f
_f:
    cmp     edi, 0
    jg      after_call
    call    _negate
after_call:
    lea     rax, [rax*2+1]
    ret

This got me thinking: what if x86 had a conditional call instruction like ARM? Imagine if there was such an instruction "ccallcc" with semantics like cmovcc. Then you could do something like:

.global _f
_f:
    cmp     edi, 0
    ccalll  _negate
    lea     rax, [rax*2+1]
    ret

Although we can't avoid the branch prediction, we do eliminate a branch. Namely, in the actual GCC/clang output, we are forced to branch regardless of whether num < 0 or not. And if num < 0 we have to branch twice. This seems wasteful.

Now such an instruction doesn't exist in amd64, but I devised a way to simulate such an instruction. I did this by breaking call func into its component parts: push rip (well technically [rip+label_after_call_instruction]) and then jmp func. We can make the jmp conditional, but there is no conditional push. We can simulate this by computing [rip+label_after_call_instruction] and writing it to the appropriate location on the stack, then conditionally updating rsp if we plan to call the function (which actually "pushes" the [rip+label_after_call_instruction]). It looks something like this:

.global _f
_f:
    cmp     edi, 0

    # ccalll _negate
    lea     rax, [rip+after_ccall]  # Compute return address
    mov     [rsp-8], rax            # Prepare to "push" return address
    lea     rax, [rsp-8]            # Compute rsp (after push)
    cmovl   rsp, rax                # Conditionally push (by actually changing rsp)
    jl      _negate                 # "Conditional call"
after_ccall:

    lea     rax, [rax*2+1]
    ret

There are a few potential downsides to this approach:

  • It introduces several instructions (but they total less cycles than the branch mispredict penalty)
  • It requires a write to memory (but the stack probably is cached?)
  • It always executes the 2 leas and mov even if the call isn't made (but my understanding is this doesn't matter as cmovcc takes the same number of cycles as mov, for example)

To examine the properties of each of these approaches, I ran the critical sections through iaca. If you have it installed (and you clone my benchmark gist below), you can run make iaca to see for yourself. Pass IACAFLAGS='-arch=...' to specify a different arch.

The output for the branch over approach:

Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File -  ./branch_over_call_iaca.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 0.82 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  36
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  0.5     0.0  |  0.0  |  0.3     0.0  |  0.3     0.0  |  1.0  |  0.0  |  0.5  |  0.3  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      | 0.5         |      |             |             |      |      | 0.5  |      | jnle 0x6
|   4^#    |             |      | 0.3         | 0.3         | 1.0  |      |      | 0.3  | call 0x5
Total Num Of Uops: 5

And the output for the conditional call approach:

Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File -  ./conditional_call_iaca.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.94 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  35
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.0     0.0  |  1.0  |  0.5     0.0  |  0.5     0.0  |  1.0  |  1.0  |  1.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             | 1.0  |             |             |      |      |      |      | lea rax, ptr [rip]
|   2^     |             |      | 0.5         | 0.5         | 1.0  |      |      |      | mov qword ptr [rsp-0x8], rax
|   1      |             |      |             |             |      | 1.0  |      |      | lea rax, ptr [rsp-0x8]
|   1      | 1.0         |      |             |             |      |      |      |      | cmovl rsp, rax
|   1      |             |      |             |             |      |      | 1.0  |      | jl 0x6
Total Num Of Uops: 6

I looks like the conditional call approach appears to use more of the hardware. But I found it interesting that the conditional approach only has 1 more uop (the branch over approach had 5 uops). I guess this makes sense given that under the hood the call turns into a push and jmp (and the push turns into rsp math and a memory mov). This would suggest to me that the conditional call approach is approximately equivalent (although maybe my simplistic analysis is flawed here?).

At the least, my overarching suspicion that was by introducing several instructions between the cmp and jl, I'd make it possible that the result of the cmp would be available before the jl can be speculatively executed (thus preventing the branch prediction at all). Although maybe the pipeline is longer than this? This treads into areas with which (despite having read and retained a medium-depth understanding of Agner Fog's optimization manuals) I am not very familiar.

My hypothesis is that for a uniform distribution of (negative and positive) nums (where branch prediction won't be able to predict the branch around the call) that my "conditional call" approach will outperform branching around the call.

I wrote a harness to benchmark the performance of these two approaches. You can git clone https://gist.github.com/baileyparker/8a13c22d0e26396921f501fe87f166a9 and make to run the benchmarks on your machine.

Here's the runtime of 100 iterations of each approach on an array of 1,048,576 numbers (uniformly distributed between int32_t min and max).

|                    CPU                    | Conditional Call | Branch Over |
|-------------------------------------------|-----------------:|------------:|
| Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz |       10.9872 ms |   8.4602 ms |
| Intel(R) Xeon(R) CPU E3-1240 v6 @ 3.70GHz |        8.8132 ms |   7.0704 ms |

These results are consistent across runs and although magnified by increasing the array size (or number of iterations), branching over always wins.

I also tried reordering the conditional call steps (computing and conditionaly updating rsp first, then writing to the stack) but this performed similarly.

What hardware detail that am I missing (or misunderstanding) explains this? From my calculations the extra instructions add somewhere around 6-7 cycles, but a branch mispredict costs 15. So, on average half the numbers are predicted wrong so each iteration costs 15/2 cycles (for the branch over approach) and always 6-7 cycles for the conditional call. The uops from iaca suggest the approaches are even closer in this regard. So, shouldn't the performance be closer? Is my example code too contrived/short? Is my benchmarking technique not appropriate for this kind of low level critical section testing? Is there a way to reorder/change the conditional call to make it more performant (better or comparable to the branch over approach, maybe)?

tl;dr Why does my conditional call code (4th code snippet) perform worse than what gcc/clang produces (conditional jump over the call) (2nd code snippet) (for the code in the 1st snippet) on my benchmark?

Magisterial answered 25/2, 2019 at 14:37 Comment(9)
By performing a function call through a push and jump, you do not make an entry into the return predictor stack, trashing the return prediction. This causes a huge latency spike on return from your conditionally called function and all subsequent returns. The branch predictor works pretty well and an extra jump is cheap compared to the cost of the function you call, so I don't quite see the point of what you try to do.Hedveh
Read this article for some information on return prediction.Hedveh
@Hedveh Oh wow, that's almost certainly it. The numbers in Table 1 from that link tell that exact story. Doing some rough math 23 cycles more (for call + ret vs jmp + ret) @ 3.1 GHz for 1,048,576 calls is +7.7ms. Obviously that's much more than observed, but perhaps the branch predictor gets better since the return is always to the same location.Magisterial
Cool! Write an answer detailling your findings so you can get all the upvotes.Hedveh
I'm trying to compile your code , but the build fails using both g++ 5.4 and g++ 7.3. With g++ 5.4, I think it fails because it does not support template argument detection which is required for the expression uniform_int_distribution in random_nums. With g++ 7.3, the error says expected constructor, destructor, or type conversion before ( token at TEST_CASE in the benchmark.cpp file.Longoria
For g++ 5.4, the error can be resolved by adding angle brackets <> to uniform_int_distribution, but I'm not sure how to resolve the error emitted by g++ 7.3 because I'm not familiar with TEST_CASE.Longoria
The random_nums function creates a vector that is twice as large as the value specified in SIZE. The first half of the vector is all zeros and only the second half is randomized. Is this intentional or is it a bug?Longoria
@HadiBrais It might be that g++5.4 is too old? 7.3 should work though, not sure what's going on there. FWIW my second benchmark was compiled with gcc 8, so maybe it only works with 8. TEST_CASE is from Catch2. And no, that's a bug. I've fixed it in the gist, but it didn't change the results. If anything it actually gave the branch method more of an edge (perhaps that indicates that the conditional call approach benefits from good branch prediction--which is strange given that the intent was to avoid it).Magisterial
I'm doing microarchitectural analysis for both methods. I was not aware of the bug and the results made zero sense to me, until I found the bug. I got away with TEST_CASE by replacing it with the main function and my own measurement code.Longoria
L
4

You can exactly determine why the conditional_call approach is slower than branch_over_call. You've done your experiments on two KBL processors, but the blog post you were referred to doesn't discuss how the RAS works on KBL. So the first step of the analysis is to determine whether the ret in the negate function is being mispredicted or not (as what would happen on earlier microarchitectures). The second step is to determine what is the cost of mispredicting that ret instruction on total execution time. The closest thing that I have to KBL is CFL and my numbers turned out to be close to yours. The only relevant difference between the two is that LSD is enabled in CFL but disabled in KBL. However, the LSD is irrelevant in this case because of the call instruction in the loop which prevents the LSD from detecting any loop. You can also easily repeat the same analysis on KBL.

There are several ways to analyze the behavior of branch instructions. But in this particular case, the code is simple enough for the event counting method to reveal all the information that we need about every static branch instruction.

The BR_INST_RETIRED_* performance events can be used count the total number of dynamic branch instructions retired and the total number of specific types of retired branch instructions including conditional, calls, and returns. The BR_MISP_RETIRED_* events can be used to count total mispredictions, total conditional mispredictions, and total call mispredictions.

The complete control-glow graph of conditional_call looks like this:

           total   misp
call         1      0
    jl       1     0.5
       ret  0.5     1
    ret      1      0
jne          1      0

The first call instruction calls the conditional_call function, which contains jl and ret. The jl instruction conditionally jumps to the negate function, which contains ret. The jne instruction is used to for the loop. The numbers shown in the first and second column are normalized by the total number of iterations and total number of dynamic instructions, respectively. We know from the static structure of the program that call, jl, conditional_call's ret, and jne are each executed once in every iteration. The inner most ret is only executed when the jl branch is taken. Using the performance events, we can count the total number of executed return instructions and subtract from it the total number of iterations to get the number of times the inner most ret is executed. Because the input is randomized according to the uniform distribution, it shouldn't be surprising that the inner most ret is executed half of the time.

The call instruction is never mispredicted. The jne instruction is also never mispredicted except for the last execution of the instructions (where it exits the loop). Therefore, we can attribute the total number of conditional mispredictions to the jl instruction. That can be subtracted from the total number of mispredictions to get the number of return mispredictions which can be attributed to either or both of the return instructions. The second ret may mispredict when the misprediction of the first ret clobbers or misaligns the RAS. One way to determine whether the second ret is ever mispredicted is by using precise sampling of BR_MISP_RETIRED.ALL_BRANCHES. Another way is by using the method described in the blog post you cited. Indeed, only the inner most ret is mispredicted. The fact that jl is mispredicted half of the time suggests that the instruction is either being predicted always taken or always not taken.

The complete control-glow graph of branch_over_call looks like this:

           total   misp
call         1      0
    jg       1     0.5
    call    0.5     0
        ret 0.5     0
    ret      1      0
jne          1      0

The only instruction that is mispredicted is jg, which is mispredicted half of the time.

To measure the average cost of a single ret misprediction in the conditional_call approach, the ret instruction can be replaced with a lea/jmp sequence so that BTB rather than the RAS is used for making predictions. With this change, the only instruction that is mispredicted is jl. The difference in execution time can be considered as an estimate for the total cost of ret mispredictions. On my CFL processor, this is about 11.3 cycles per ret misprediction. In addition, conditional_call has become about 3% faster than branch_over_call. Your numbers on KBL indicate that the average cost of a ret misprediction is about 13 cycles. I'm not sure what the reason for this difference is. It may not be microarchitectural. I've used gcc 7.3 but you used gcc 8, so perhaps there are some differences in the code or the alignments of different pieces of code that are causing the discrepancy between our results.

Longoria answered 6/3, 2019 at 21:18 Comment(3)
This is a fantastic analysis! Thanks so much! I'm giving it another read through to research things I'm unfamiliar with (like the LSD). Just be clear, though, the observation of conditional being 3% faster on CFL was with replacing the ret in negate with an lea+jmp, correct? I guess that's probably not significant enough, but definitely interesting. Your point about gcc versions and framework is duly noted. I was lazy and copied and pasted some higher level benchmarking code from a previous project of mine. I should've just written it all out in assembly.Magisterial
@BaileyParker Yep, that's correct. Note that the speedup can be higher or lower depending on the surrounding code.Longoria
@BaileyParker You can learn more about the LSD here and here and here. Note that the LSD is also called the loopback buffer.Longoria
M
2

As @fuz pointed out in the comments, the performance issue is almost certainly due to the Return Address Stack (RAS), which is a specialized branch predictor for function returns.

As an advantage of having separate call and ret instructions from jmp and manual stack modification, CPUs are clued in to the intent of the running code. In particular, when we call a function it is probably going to ret and when it does we are going to jump back to the rip pushed before the call. In other words, calls are usually paired with a ret. The CPU leverages this by keeping a fixed-length stack of just return addresses called the return address stack (RAS). call instructions in addition to pushing the return address to the actual in-memory stack will additionally push it to the RAS. This way, when a ret is encountered the CPU can pop off of the RAS (which is much faster than the memory access for the actual stack) and speculatively execute the return. If it turns out that the address popped from the RAS was the one popped from the stack, the CPU continues with no penalty. However, if the RAS predicted the wrong return address, a pipeline flush occurs, which is costly.

My original intuition was that the conditional instructions would be better because they would give time for the result of the comparison to arrive before the jump. However, whatever benefit that may have provided, having an unbalanced jmp/ret (my conditional call replaced call with jmp, but the called function still used a ret) caused the RAS to likely always predict the wrong return address (and thus my approach, despite originally trying to avoid this, causes more pipeline stalls). The speedup from the RAS is more significant than my "optimization" so the branching approach outperformed the conditional call approach.

According to some empirical results mismatching call and ret (in particular using a jmp + ret) take 5-6 times more cycles than properly pairing call and ret. Some napkin math would suggest that a penalty of +21 cycles at 3.1GHz for 1,048,576 calls add about 7.1ms to the total runtime. The slowdown observed was less than that. This is likely a combination of the conditional instructions delaying the jump until the condition was ready and the fact that the jumps were oscillating between fixed locations in memory (which the other branch predictors likely became good at predicting).

Magisterial answered 1/3, 2019 at 12:8 Comment(3)
Fun fact: you can safely do a conditional tailcall, like jg _negate, in cases where that's viable. (The rel32 jump range is identical to a direct near call rel32 on x86-64, and can cover the full address space in 32-bit mode). Current C compilers fail to do that optimization (e.g. gcc.gnu.org/bugzilla/show_bug.cgi?id=69576), but you can/should do it yourself when appropriate (instead of conditionally jumping over a jmp tailcall).Superstitious
Interesting! That was going to be one of my follow up questions when I noticed that both gcc and clang would optimize a unconditional tail call to a jmp, but refused to do so for the conditional. I introduced the lea rax, [rax*2+1] to make that not possible for the benchmark.Magisterial
@BaileyParker - you could try to fix up your method by also replacing the ret instruction at the end of your function with a jmp to a register that you populate at the call side. I.e., dispense with the stack entirely and pass the return address in a register. Of course, this prevents you from calling any normal functions since this is effectively a custom calling convention, but it's worth seeing if this version can beat call/ret (I suspect it can, on some benchmarks).Sourpuss

© 2022 - 2024 — McMap. All rights reserved.