Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs
Asked Answered
W

4

28

This is related, but not the same, as this question: Performance optimisations of x86-64 assembly - Alignment and branch prediction and is slightly related to my previous question: Unsigned 64-bit to double conversion: why this algorithm from g++

The following is a not real-world test case. This primality testing algorithm is not sensible. I suspect any real-world algorithm would never execute such a small inner-loop quite so many times (num is a prime of size about 2**50). In C++11:

using nt = unsigned long long;
bool is_prime_float(nt num)
{
   for (nt n=2; n<=sqrt(num); ++n) {
      if ( (num%n)==0 ) { return false; }
   }
   return true;
}

Then g++ -std=c++11 -O3 -S produces the following, with RCX containing n and XMM6 containing sqrt(num). See my previous post for the remaining code (which is never executed in this example, as RCX never becomes large enough to be treated as a signed negative).

jmp .L20
.p2align 4,,10
.L37:
pxor    %xmm0, %xmm0
cvtsi2sdq   %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb  .L36   // Exit the loop
.L20:
xorl    %edx, %edx
movq    %rbx, %rax
divq    %rcx
testq   %rdx, %rdx
je  .L30   // Failed divisibility test
addq    $1, %rcx
jns .L37
// Further code to deal with case when ucomisd can't be used

I time this using std::chrono::steady_clock. I kept getting weird performance changes: from just adding or deleting other code. I eventually tracked this down to an alignment issue. The command .p2align 4,,10 tried to align to a 2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to balance between alignment and code size.

I wrote a Python script to replace .p2align 4,,10 by a manually controlled number of nop instructions. The following scatter plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the x-axis:

Scatter plot

From objdump with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a laptop, Sandybridge i5-3210m, turboboost disabled, I found that

  • For 0 byte padding, slow performance (0.42 secs)
  • For 1-4 bytes padding (offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the plot).
  • For 5-20 bytes padding (offset 0x402f64 to 0x402f73) get fast performance (0.37s)
  • For 21-32 bytes padding (offset 0x402f74 to 0x402f7f) slow performance (0.42 secs)
  • Then cycles on a 32 byte sample

So a 16-byte alignment doesn't give the best performance-- it puts us in the slightly better (or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19 gives the best performance.

Why am I seeing this performance difference? Why does this seem to violate the rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation manual)

I don't see any branch-prediction problems. Could this be a uop cache quirk??

By changing the C++ algorithm to cache sqrt(num) in an 64-bit integer and then make the loop purely integer based, I remove the problem-- alignment now makes no difference at all.

Wold answered 13/11, 2014 at 11:9 Comment(30)
A modern Intel processor using your basic algorithm can tests for primes up to 2^64 on order 10 seconds so that's not actually that bad.Checkered
stackoverflow.com/questions/25958649/…Checkered
Let me demonstrate my ignorance on the subject of primes. Assuming you want to know with 100% certain (not just with high probability) if a number is prime how bad is this trial by division algorithm?Checkered
I've never really done tests; but there are deterministic algorithms which asymptotically are much better than $O(\sqrt{n})$, see en.wikipedia.org/wiki/Primality_test#Fast_deterministic_tests However, trial division does have the benefit of giving a factor, see en.wikipedia.org/wiki/Integer_factorizationWold
Thanks, I guess I'm more interested in getting the factor than just knowing if it's prime. Anyway, this digresses from your real question. Let's hope you get a good answer.Checkered
See also: stackoverflow.com/questions/17896714/…Wold
Have you tried running it with vtune?Collings
i5-3210m should be IvyBridge, the die-shrink of SandyBridge. It does have some minor changes from SnB, but I think the front-end loop buffer should behave the same here. Also, that look is a great example of why you should hoist the sqrt manually, with int upper_bound=sqrt(num), so the n<=upper_bound is an int comparison, not double. There's no need to cvtsi2sd the loop counter every iteration! Regardless, it's surprising that alignment makes any difference at all for a loop this short, which should fit in the loop buffer (26 uops).Nellienellir
Oh nvm, this loop doesn't fit in the uop cache, because 64-bit DIV is 35-57 uops. It's micro-coded with a variable number of uops, so IDK how its stored in the frontend. I'll see if I can write this up as an answer.Nellienellir
@PeterCordes I'd draw your attention to the fact that on HSW the loop buffer is 56 uops and dynamically shared between the two threads 28/28, so div at 37 would unconditionally blow up its half of the loop buffer if OP was hyperthreading and force the use of the uop cache.Anna
@IwillnotexistIdonotexist: This is an IvyBridge, not HSW, so 64-bit DIV is not fixed at 37 uops (like HSW). It can still exceed the uop queue (loop buffer), even when HT is disabled. (IvB does dynamically combine into a 56 entry buffer, unlike SnB.). It's not clear to me how micro-coded instructions interact with the loop buffer at all. A loop can't use the same data-dependent number of uops.Nellienellir
Micro-coded instructions take an entire uop cache line to themselves. The main difficulty in figuring out exactly what's happening here is that the question doesn't specify how far away those other branch targets are. rel8 vs. rel32 jumps change the alignment of the rest of the code inside the loop, for a given starting offset.Nellienellir
I'm not totally clear on what Intel means by "All uops in a Way (uop cache line) represent insns which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region." I think they mean a boundary-crossing goes in the earlier 32B block, containing the first byte of the instruction. But it's not totally clear because while an insn is executing, EIP=the start of the next insn (for relative jumps or RIP-relative addressing). I'm also not sure about macro-fused compare-and-branch. Only a 64B boundary stops macro-fusion, not just 32B.Nellienellir
@PeterCordes For what it's worth, my attempt, together with my prime (1000000000000037) and machine code. I can't reproduce the issue on any alignment 0-16. Every run takes within 3% of the same number of unhalted core cycles. The one useful thing you learn is that the Loop Stream Detector contributes nothing (probably due to microcoded division). I did these timings with a small C library I've been concocting for some time for my Haswell performance-monitoring counter needs; You might be interested. I'm feeling like it's ready for some more public exposure.Anna
@PeterCordes Confirmed: div does not permit use of the Loop Stream Detector / loop buffer / whatever they'll call it next.Anna
@IwillnotexistIdonotexist: thanks. Did you manage to figure out whether the uops are coming from the uop cache, or if it's switching to the legacy decoder? software.intel.com/en-us/node/589932 mentions a IDQ.MS_SWITCHES event: Number of switches from DSB (Decode Stream Buffer) or MITE (legacy decode pipeline) to the Microcode Sequencer (for SnB EP). I assume you tested on Haswell?Nellienellir
@PeterCordes In my logs above, basically 100% of all ops come from MITE, with no switching. Also, I can strengthen my statement: All microcoded ops (Those taking >4 uops) prevent use of the LSD. I just compared 3-instruction tight loops involving dpps v, m, i (6 uops => microcoded) versus dppd v, m, i (4 uops => not microcoded) and in the former, 0 uops come from LSD and in the latter, nearly 100% do. And yes, I tested on HSW (since that's what I got), but I don't think IVB differs that much from HSW in this respect (?).Anna
@IwillnotexistIdonotexist: AFAIK they're the same. They might have different perf counter events available, which is why I mentioned it. Thanks for testing DPPS vs DPPD, BTW. Good confirmation that it applies even with a small fixed number of non-data-dependent uops.Nellienellir
@IwillnotexistIdonotexist: It seems that MITE includes the uop cache, right? That Intel perf counter doc calls it the "legacy decode pipeline", but the optimization manual (and Agner Fog's stuff) do talk about micro-coded uops going in the uop cache. (Taking a whole line each). Agner has done pretty extensive testing of the uop cache, I think.Nellienellir
@PeterCordes It does not: DSB = uop cache per this document.Anna
@IwillnotexistIdonotexist: Ah right, I was thinking DSB was yet another name for the loop buffer, not the uop cache, since the name fits much better there. Who names this crap? Probably someone on LSD (the chemical kind).Nellienellir
@IwillnotexistIdonotexist: ok, so when you tested DPPS, did it come from the uop cache?Nellienellir
@PeterCordes I did dpps for 100K iterations and my counters give 700K uops, of which: idq.dsb_uops 499966284 and idq.ms_dsb_uops 200000595.Anna
@PeterCordes Oh wait I was mistaken. I just coded up a loop: div rcx; dec rcx; jne loop and iterated 100M times dividing zero by a counter. The damage is 3.7B uops, of which 3.2B were fed into DSB by the microcode sequencer and 0.5B came direct from DSB. 0 came from LSD.Anna
@PeterCordes Which quite frankly sounds like dec+jne fused for 100M uops in DSB, the first 4 uops of division also exist in DSB, but the 32 remaining are bottlenecked on the MS. This, combined with the fact that Haswell's division is 36 uops and evenly spread p0 p1 p5 p6 (All of which have integer ALUs and of which p6 is a port for predicted-taken branches), makes me think that internally, division executes a high-radix, 4-uop/iteration loop producing ~8 bits at a time of the quotient.Anna
@Iwill Interesting. That makes, and is consistent with what I vaguely recall reading about Intel's divide hardware. You're probably right about the p6 uops being predicted-taken dec&branch on a loop counter.Nellienellir
Fun fact: microcode branches (like rep movs startup) aren't subject to dynamic branch prediction by the usual branch-prediction hardware (and this is why it has such high startup overhead even when used repeatedly, as Andy Glew (designer of the original P6 rep-string implementation) explained). They don't mispredict AFAIK, so maybe microcode branches are special and aren't speculatively executed? Obviously they can loop efficiently, though.Nellienellir
@PeterCordes I made public the tool I used to make those benchmarks, libpfc. I strongly suspect you'll enjoy it.Anna
@IwillnotexistIdonotexist: I finally posted the answer I had mostly typed up before getting bogged down in different branch lengths.Nellienellir
@MatthewDaws - how did you re-assemble the modified assembly to generate the binaries you tested? In particular, what is the section alignment of the .text section in your binary? You can use objdump -h to see it. If the alignment is less than 32, it's hard to interpret your results because you we don't know if the code is aligned to a 32B boundary or a 16B but not 32 bit boundary. I was able to reproduce similar results in skylake but I can't tell if they line up well with yours without that info.Gantz
G
25

Here's what I found on Skylake for the same loop. All the code to reproduce my tests on your hardware is on github.

I observe three different performance levels based on alignment, whereas the OP only really saw 2 primary ones. The levels are very distinct and repeatable2:

enter image description here

We see three distinct performance levels here (the pattern repeats starting from offset 32), which we'll call regions 1, 2 and 3, from left to right (region 2 is split into two parts straddling region 3). The fastest region (1) is from offset 0 to 8, the middle (2) region is from 9-18 and 28-31, and the slowest (3) is from 19-27. The difference between each region is close to or exactly 1 cycle/iteration.

Based on the performance counters, the fastest region is very different from the other two:

  • All the instructions are delivered from the legacy decoder, not from the DSB1.
  • There are exactly 2 decoder <-> microcode switches (idq_ms_switches) for every iteration of the loop.

On the hand, the two slower regions are fairly similar:

  • All the instructions are delivered from the DSB (uop cache), and not from the legacy decoder.
  • There are exactly 3 decoder <-> microcode switches per iteration of the loop.

The transition from the fastest to the middle region, as the offset changes from 8 to 9, corresponds exactly to when the loop starts fitting in the uop buffer, because of alignment issues. You count this out in exactly the same way as Peter did in his answer:

Offset 8:

  LSD? <_start.L37>:
  ab 1 4000a8:  66 0f ef c0             pxor   xmm0,xmm0
  ab 1 4000ac:  f2 48 0f 2a c1          cvtsi2sd xmm0,rcx
  ab 1 4000b1:  66 0f 2e f0             ucomisd xmm6,xmm0
  ab 1 4000b5:  72 21                   jb     4000d8 <_start.L36>
  ab 2 4000b7:  31 d2                   xor    edx,edx
  ab 2 4000b9:  48 89 d8                mov    rax,rbx
  ab 3 4000bc:  48 f7 f1                div    rcx
  !!!! 4000bf:  48 85 d2                test   rdx,rdx
       4000c2:  74 0d                   je     4000d1 <_start.L30>
       4000c4:  48 83 c1 01             add    rcx,0x1
       4000c8:  79 de                   jns    4000a8 <_start.L37>

In the first column I've annotated how the uops for each instruction end up in the uop cache. "ab 1" means they go in the set associated with address like ...???a? or ...???b? (each set covers 32 bytes, aka 0x20), while 1 means way 1 (out of a max of 3).

At the point !!! this busts out of the uop cache because the test instruction has no where to go, all the 3 ways are used up.

Let's look at offset 9 on the other hand:

00000000004000a9 <_start.L37>:
  ab 1 4000a9:  66 0f ef c0             pxor   xmm0,xmm0
  ab 1 4000ad:  f2 48 0f 2a c1          cvtsi2sd xmm0,rcx
  ab 1 4000b2:  66 0f 2e f0             ucomisd xmm6,xmm0
  ab 1 4000b6:  72 21                   jb     4000d9 <_start.L36>
  ab 2 4000b8:  31 d2                   xor    edx,edx
  ab 2 4000ba:  48 89 d8                mov    rax,rbx
  ab 3 4000bd:  48 f7 f1                div    rcx
  cd 1 4000c0:  48 85 d2                test   rdx,rdx
  cd 1 4000c3:  74 0d                   je     4000d2 <_start.L30>
  cd 1 4000c5:  48 83 c1 01             add    rcx,0x1
  cd 1 4000c9:  79 de                   jns    4000a9 <_start.L37>

Now there is no problem! The test instruction has slipped into the next 32B line (the cd line), so everything fits in the uop cache.

So that explains why stuff changes between the MITE and DSB at that point. It doesn't, however, explain why the MITE path is faster. I tried some simpler tests with div in a loop, and you can reproduce this with simpler loops without any of the floating point stuff. It's weird and sensitive to random other stuff you put in the loop.

For example this loop also executes faster out of the legacy decoder than the DSB:

ALIGN 32
    <add some nops here to swtich between DSB and MITE>
.top:
    add r8, r9
    xor eax, eax
    div rbx
    xor edx, edx
    times 5 add eax, eax
    dec rcx
    jnz .top

In that loop, adding the pointless add r8, r9 instruction, which doesn't really interact with the rest of the loop, sped things up for the MITE version (but not the DSB version).

So I think the difference between region 1 an region 2 and 3 is due to the former executing out of the legacy decoder (which, oddly, makes it faster).


Let's also take a look at the offset 18 to offset 19 transition (where region2 ends and 3 starts):

Offset 18:

00000000004000b2 <_start.L37>:
  ab 1 4000b2:  66 0f ef c0             pxor   xmm0,xmm0
  ab 1  4000b6: f2 48 0f 2a c1          cvtsi2sd xmm0,rcx
  ab 1  4000bb: 66 0f 2e f0             ucomisd xmm6,xmm0
  ab 1  4000bf: 72 21                   jb     4000e2 <_start.L36>
  cd 1  4000c1: 31 d2                   xor    edx,edx
  cd 1  4000c3: 48 89 d8                mov    rax,rbx
  cd 2  4000c6: 48 f7 f1                div    rcx
  cd 3  4000c9: 48 85 d2                test   rdx,rdx
  cd 3  4000cc: 74 0d                   je     4000db <_start.L30>
  cd 3  4000ce: 48 83 c1 01             add    rcx,0x1
  cd 3  4000d2: 79 de                   jns    4000b2 <_start.L37>

Offset 19:

00000000004000b3 <_start.L37>:
  ab 1 4000b3:  66 0f ef c0             pxor   xmm0,xmm0
  ab 1 4000b7:  f2 48 0f 2a c1          cvtsi2sd xmm0,rcx
  ab 1 4000bc:  66 0f 2e f0             ucomisd xmm6,xmm0
  cd 1 4000c0:  72 21                   jb     4000e3 <_start.L36>
  cd 1 4000c2:  31 d2                   xor    edx,edx
  cd 1 4000c4:  48 89 d8                mov    rax,rbx
  cd 2 4000c7:  48 f7 f1                div    rcx
  cd 3 4000ca:  48 85 d2                test   rdx,rdx
  cd 3 4000cd:  74 0d                   je     4000dc <_start.L30>
  cd 3 4000cf:  48 83 c1 01             add    rcx,0x1
  cd 3 4000d3:  79 de                   jns    4000b3 <_start.L37>

The only difference I see here is that the first 4 instructions in the offset 18 case fit into the ab cache line, but only 3 in the offset 19 case. If we hypothesize that the DSB can only deliver uops to the IDQ from one cache set, this means that at some point one uop may be issued and executed a cycle earlier in the offset 18 scenario than in the 19 scenario (imagine, for example, that the IDQ is empty). Depending on exactly what port that uop goes to in the context of the surrounding uop flow, that may delay the loop by one cycle. Indeed, the difference between region 2 and 3 is ~1 cycle (within the margin of error).

So I think we can say that the difference between 2 and 3 is likely due to uop cache alignment - region 2 has a slightly better alignment than 3, in terms of issuing one additional uop one cycle earlier.


Some addition notes on things I checked that didn't pan out as being a possible cause of the slowdowns:

  • Despite the DSB modes (regions 2 and 3) having 3 microcode switches versus the 2 of the MITE path (region 1), that doesn't seem to directly cause the slowdown. In particular, simpler loops with div execute in identical cycle counts, but still show 3 and 2 switches for DSB and MITE paths respectively. So that's normal and doesn't directly imply the slowdown.

  • Both paths execute essentially identical number of uops and, in particular, have identical number of uops generated by the microcode sequencer. So it's not like there is more overall work being done in the different regions.

  • There wasn't really an difference in cache misses (very low, as expected) at various levels, branch mispredictions (essentially zero3), or any other types of penalties or unusual conditions I checked.

What did bear fruit is looking at the pattern of execution unit usage across the various regions. Here's a look at the distribution of uops executed per cycle and some stall metrics:

+----------------------------+----------+----------+----------+
|                            | Region 1 | Region 2 | Region 3 |
+----------------------------+----------+----------+----------+
| cycles:                    | 7.7e8    | 8.0e8    | 8.3e8    |
| uops_executed_stall_cycles | 18%      | 24%      | 23%      |
| exe_activity_1_ports_util  | 31%      | 22%      | 27%      |
| exe_activity_2_ports_util  | 29%      | 31%      | 28%      |
| exe_activity_3_ports_util  | 12%      | 19%      | 19%      |
| exe_activity_4_ports_util  | 10%      | 4%       | 3%       |
+----------------------------+----------+----------+----------+

I sampled a few different offset values and the results were consistent within each region, yet between the regions you have quite different results. In particular, in region 1, you have fewer stall cycles (cycles where no uop is executed). You also have significant variation in the non-stall cycles, although no clear "better" or "worse" trend is evident. For example, region 1 has many more cycles (10% vs 3% or 4%) with 4 uops executed, but the other regions largely make up for it with more cycles with 3 uops executed, and few cycles with 1 uop executed.

The difference in UPC4 that the execution distribution above implies fully explains the difference in performance (this is probably a tautology since we already confirmed the uop count is the same between them).

Let's see what toplev.py has to say about it ... (results omitted).

Well, toplev suggests that the primary bottleneck is the front-end (50+%). I don't think you can trust this because the way it calculates FE-bound seems broken in the case of long strings of micro-coded instructions. FE-bound is based on frontend_retired.latency_ge_8, which is defined as:

Retired instructions that are fetched after an interval where the front-end delivered no uops for a period of 8 cycles which was not interrupted by a back-end stall. (Supports PEBS)

Normally that makes sense. You are counting instructions which were delayed because the frontend wasn't delivering cycles. The "not interrupted by a back-end stall" condition ensures that this doesn't trigger when the front-end isn't delivering uops simply because is the backend is not able to accept them (e.g,. when the RS is full because the backend is performing some low-throuput instructions).

It kind of seems for div instructions - even a simple loop with pretty much just one div shows:

FE      Frontend_Bound:                57.59 %           [100.00%]
BAD     Bad_Speculation:                0.01 %below      [100.00%]
BE      Backend_Bound:                  0.11 %below      [100.00%]
RET     Retiring:                      42.28 %below      [100.00%]

That is, the only bottleneck is the front-end ("retiring" is not a bottleneck, it represents the useful work). Clearly, such a loop is trivially handled by the front-end and is instead limited by the backend's ability to chew threw all the uops generated by the div operation. Toplev might get this really wrong because (1) it may be that the uops delivered by the microcode sequencer aren't counted in the frontend_retired.latency... counters, so that every div operation causes that event to count all the subsequent instructions (even though the CPU was busy during that period - there was no real stall), or (2) the microcode sequencer might deliver all its ups essentially "up front", slamming ~36 uops into the IDQ, at which point it doesn't deliver any more until the div is finished, or something like that.

Still, we can look at the lower levels of toplev for hints:

The main difference toplev calls out between the regions 1 and region 2 and 3 is the increased penalty of ms_switches for the latter two regions (since they incur 3 every iteration vs 2 for the legacy path. Internally, toplev estimates a 2-cycle penalty in the frontend for such switches. Of course, whether these penalties actually slow anything down depends in a complex way on the instruction queue and other factors. As mentioned above, a simple loop with div doesn't show any difference between the DSB and MITE paths, a loop with additional instructions does. So it could be that the extra switch bubble is absorbed in simpler loops (where the backend processing of all the uops generated by the div is the main factor), but once you add some other work in the loop, the switches become a factor at least for the transition period between the div and non-div` work.

So I guess my conclusion is that the way the div instruction interacts with the rest of the frontend uop flow, and backend execution, isn't completely well understood. We know it involves a flood of uops, delivered both from the MITE/DSB (seems like 4 uops per div) and from the microcode sequencer (seems like ~32 uops per div, although it changes with different input values to the div op) - but we don't know what those uops are (we can see their port distribution though). All that makes the behavior fairly opaque, but I think it is probably down to either the MS switches bottlnecking the front-end, or slight differences in the uop delivery flow resulting in different scheduling decisions which end up making the MITE order master.


1 Of course, most of the uops are not delivered from the legacy decoder or DSB at all, but by the microcode sequencer (ms). So we loosely talk about instructions delivered, not uops.

2 Note that the x axis here is "offset bytes from 32B alignment". That is, 0 means the top of the loop (label .L37) is aligned to a 32B boundary, and 5 means the loop starts five bytes below a 32B boundary (using nop for padding) and so on. So my padding bytes and offset are the same. The OP used a different meaning for offset, if I understand it correctly: his 1 byte of padding resulted in a 0 offset. So you would subtract 1 from the OPs padding values to get my offset values.

3 In fact, the branch prediction rate for a typical test with prime=1000000000000037 was ~99.999997%, reflecting only 3 mispredicted branches in the entire run (likely on the first pass through the loop, and the last iteration).

4 UPC, i.e., uops per cycle - a measure closely related to IPC for similar programs, and one that is a bit more precise when we are looking in detail at uop flows. In this case, we already know the uop counts are the same for all variations of alignment, so UPC and IPC will be directly proportional.

Gantz answered 10/10, 2016 at 8:37 Comment(16)
Glorious, definitive answer.Anna
@IwillnotexistIdonotexist - heh, have a re-read if you have a moment because I just added a lot more details (I got tired writing the original post and posted it unfinished). In particular, there is strong evidence for the difference between region 1 and 2 being because 1 goes to the legacy decoder, and (newly added) the difference between 2 and 3 being due to the breakdown of uops into the DSB. All in all, we're only takling a ~1 cycle difference between each region, so it only takes a small change to explain it.Gantz
It's also nice you have some SKL hardware to reproduce OP's issue. My only modern machine is my HSW, and I wasn't able to reproduce OP with my libpfc; Everything took about the same time. My lack of hardware also explains why I coded up pfc.ko to only support PME architecture version 3 - because that's what Haswell supports, and I could, in theory, royally KP a machine if I misprogram the MSRs using code for a different PME arch version.Anna
@IwillnotexistIdonotexist - btw you mentioned that you didn't find anything on your HSW, or more precisely that all runs were within 3% of each other. Note though that in my tests each level was right around 3-4% different from the others (about 3.3% from 1->2 and 4.0% from 2->3) - that's approximately 1 cycle/iteration. So perhaps you could see the difference but it hides in the noise (graphically though it's easy to see differences of only 3%). I had to de-noise my system to make this work: (1) turn of hyperthreading (2) disable frequency scaling and intel_pstate driver (3) disable C-statesGantz
@IwillnotexistIdonotexist - does that mean that none of the non-architectural PMU events for SKL are supported in libpfc? That would prevent me from making good use of it, but I can certainly look at contributing a patch.Gantz
I should also add that it is not clear that I've reproduced the OP's issue. What I can say is that I've reproduced a scenario where a loop with div and some other instructions performs differently depending on alignment, but given the different hardware the reasons may be different than the OPs. I wouldn't be surprised, however, if one of the two reasons I kind of covered above was the cause (and possibly with opposite polarity - i.e., "busting the uop cache" may prove faster for this loop on SKL, but slower on IVB).Gantz
Hmm, I hadn't realized there was only a cycle difference to look at. I also hadn't disabled HT or TurboBoost, my thinking being that comparisons using the unhalted_core_cycle counter are agnostic to those issues. But disabling HT might have an effect indeed. AFAIK SKL uses PME architecture version 4. Out of fear that I might blow things up and not know it (can't test if I did), I have a check failing the module load. The list of PMU events in libpfc.c is also Haswell-centric, as I painstakingly constructed it by hand, referencing the SDM. Possibly SKL uses the same codes as HSW?Anna
I tested a lot of this stuff recently and disabling HT had a large and reproducible effect on the stability of my measurements using perf. This makes sense since the OS may occasionally schedule other threads on the other logical core, which can reduce resources available to your thread. It was the biggest help.Gantz
Disabling turbo (I used this script) and various power management features also seemed to help. It made a big difference to wall-clock and CPU times (which makes sense), but also some difference (I think) to unhalted cycle count. As you point out, that seems weird, since cycles should be more or less invariant to these things. Still, transitions may cause changed counts (e.g., if a pipeline is flushed), and certainly anything that accesses memory or (in some cases L3-L4) is affected since the clock speed ratio changes.Gantz
I added source so that anyone who wants to give it a spin, can do so.Gantz
In the 18->19 transition, did you really mean issue, rather than uop cache read? I made an edit to fix an edit error, and where you said "line" where you meant "way". But I got lost when you started talking about an "extra uop"; where does it come from? The paragraph after sounds more plausible; that delayed read of a uop from the DSB could be a factor. Hmm, when the MSROM activates, does it stream into the IDQ, or does the IDQ have to flush first so issue can bypass it? That might explain microcode being able to branch (without branch prediction) for stuff like rep movs startup.Nellienellir
Other than that, very nice answer. I was starting to wonder about crazy stuff, like maybe the REX prefix doesn't really count as the start of an instruction.Nellienellir
@PeterCordes I edited that part to clarify it, and changed it to "deliver" instead of issue. I was talking about the same thing in both paragraphs, although it was unclear. What I mean is that in the offset 18 case, the uop on the edge (the jb instruction) gets delivered from the DSB one cycle later, and that potentially changes scheduling decisions that can easily result in a 1 cycle increase in the critical path. For e.g., the in the "later" " case the jb issues to port 0, so another uop - most that would otherwise go to port zero cannot (most plausibly one associated with the `div).Gantz
Yeah, I just looked at the diff. Makes much more sense now, and sounds like a plausible theory. The issue stage might wait for 4 uops instead of just issuing what it has. That might still cause a resource conflict. But IDK if it would wait, since it's definitely capable of issuing less than 4 uops in a cycle. It does wait if there isn't room in the ROB (so you see a 4-0-4-0 pattern instead of 2-2-2-2 on code that bottlenecks in the OOO core at 2 uops per clock), but this is the other end of the spectrum, where the IDQ might be empty.Nellienellir
I'm pretty sure the IDQ would not wait for 4 uops to issue them in the less-than-full ROB case. Unlike the full-ROB case this would impact performance of some types of mixed front-end and back-end bound code. In the full-ROB case it doesn't seem to matter which approach is chosen, so you probably wouldn't be able to tell w/o the performance counters. By the way, what do you mean by the issue stage? The RAT? The RS?Gantz
Could what intel said about legacy decode not being able to switch back to DSB until branch explain why the ms switch events are variable cost?Bistoury
N
10

I don't have a specific answer, just a few different hypotheses that I'm unable to test (lack of hardware). I thought I'd found something conclusive, but I had the alignment off by one (because the question counts padding from 0x5F, not from an aligned boundary). Anyway, hopefully it's useful to post this anyway to describe the factors that are probably at play here.

The question also doesn't specify the encoding of the branches (short (2B) or near (6B)). This leaves too many possibilities to look at and theorize about exactly which instruction crossing a 32B boundary or not is causing the issue.


I think it's either a matter of the loop fitting in the uop cache or not, or else it's a matter of alignment mattering for whether it decodes fast with the legacy decoders.


Obviously that asm loop could be improved a lot (e.g. by hoisting the floating-point out of it, not to mention using a different algorithm entirely), but that's not the question. We just want to know why alignment matters for this exact loop.

You might expect that a loop that bottlenecks on division wouldn't bottleneck on the front-end or be affected by alignment, because division is slow and the loop runs very few instructions per clock. That's true, but 64-bit DIV is micro-coded as 35-57 micro-ops (uops) on IvyBridge, so it turns out there can be front-end issues.

The two main ways alignment can matter are:

I suspect this is a purely front-end issue, not branch prediction, since the code spends all its time in this loop, and isn't running other branches that might alias with the ones here.

Your Intel IvyBridge CPU is a die-shrink of SandyBridge. It has a few changes (like mov-elimination, and ERMSB), but the front-end is similar between SnB/IvB/Haswell. Agner Fog's microarch pdf has enough details to analyze what should happen when the CPU runs this code. See also David Kanter's SandyBridge writeup for a block diagram of the fetch/decode stages, but he splits the fetch/decode from the uop cache, microcode, and decoded-uop queue. At the end, there's a full block diagram of a whole core. His Haswell article has a block diagram including the whole front-end, up to the decoded-uop queue that feeds the issue stage. (IvyBridge, like Haswell, has a 56 uop queue / loopback buffer when not using Hyperthreading. Sandybridge statically partitions them into 2x28 uop queues even when HT is disabled.)

David Kanter's SnB writeup

Image copied from David Kanter's also-excellent Haswell write-up, where he includes the decoders and uop-cache in one diagram.

Let's look at how the uop cache will probably cache this loop, once things settle down. (i.e. assuming that the loop entry with a jmp to the middle of the loop doesn't have any serious long-term effect on how the loop sits in the uop cache).

According to Intel's optimization manual (2.3.2.2 Decoded ICache):

  • All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the same aligned 32-byte region. (I think this means an instruction that extends past the boundary goes in the uop cache for the block containing its start, rather than end. Spanning instructions have to go somewhere, and the branch target address that would run the instruction is the start of the insn, so it's most useful to put it in a line for that block).
  • A multi micro-op instruction cannot be split across Ways.
  • An instruction which turns on the MSROM consumes an entire Way. (i.e. any instruction that takes more than 4 uops (for the reg,reg form) is microcoded. For example, DPPD is not micro-coded (4 uops), but DPPS is (6 uops). DPPD with a memory operand that can't micro-fuse would be 5 total uops, but still wouldn't need to turn on the microcode sequencer (not tested).
  • Up to two branches are allowed per Way.
  • A pair of macro-fused instructions is kept as one micro-op.

David Kanter's SnB writeup has some more great details about the uop cache.


Let's see how the actual code will go into the uop cache

# let's consider the case where this is 32B-aligned, so it runs in 0.41s
# i.e. this is at 0x402f60, instead of 0 like this objdump -Mintel -d output on a  .o
# branch displacements are all 00, and I forgot to put in dummy labels, so they're using the rel32 encoding not rel8.

0000000000000000 <.text>:
   0:   66 0f ef c0             pxor   xmm0,xmm0    # 1 uop
   4:   f2 48 0f 2a c1          cvtsi2sd xmm0,rcx   # 2 uops
   9:   66 0f 2e f0             ucomisd xmm6,xmm0   # 2 uops
   d:   0f 82 00 00 00 00       jb     0x13         # 1 uop  (end of one uop cache line of 6 uops)

  13:   31 d2                   xor    edx,edx      # 1 uop
  15:   48 89 d8                mov    rax,rbx      # 1 uop  (end of a uop cache line: next insn doesn't fit)

  18:   48 f7 f1                div    rcx          # microcoded: fills a whole uop cache line.  (And generates 35-57 uops)

  1b:   48 85 d2                test   rdx,rdx      ### PROBLEM!!  only 3 uop cache lines can map to the same 32-byte block of x86 instructions.
  # So the whole block has to be re-decoded by the legacy decoders every time, because it doesn't fit in the uop-cache
  1e:   0f 84 00 00 00 00       je     0x24         ## spans a 32B boundary, so I think it goes with TEST in the line that includes the first byte.  Should actually macro-fuse.
  24:   48 83 c1 01             add    rcx,0x1      # 1 uop 
  28:   79 d6                   jns    0x0          # 1 uop

So with 32B alignment for the start of the loop, it has to run from the legacy decoders, which is potentially slower than running from the uop cache. There could even be some overhead in switching from uop cache to legacy decoders.

@Iwill's testing (see comments on the question) reveals that any microcoded instruction prevents a loop from running from the loopback buffer. See comments on the question. (LSD = Loop Stream Detector = loop buffer; physically the same structure as the IDQ (instruction decode queue). DSB = Decode Stream Buffer = the uop cache. MITE = legacy decoders.)

Busting the uop cache will hurt performance even if the loop is small enough to run from the LSD (28 uops minimum, or 56 without hyperthreading on IvB and Haswell).

Intel's optimization manual (section 2.3.2.4) says the LSD requirements include

  • All micro-ops are also resident in the Decoded ICache.

So this explains why microcode doesn't qualify: in that case the uop-cache only holds a pointer into to microcode, not the uops themselves. Also note that this means that busting the uop cache for any other reason (e.g. lots of single-byte NOP instructions) means a loop can't run from the LSD.


With the minimum padding to go fast, according to the OP's testing.

# branch displacements are still 32-bit, except the loop branch.
# This may not be accurate, since the question didn't give raw instruction dumps.
# the version with short jumps looks even more unlikely

0000000000000000 <loop_start-0x64>:
    ...
  5c:   00 00                   add    BYTE PTR [rax],al
  5e:   90                      nop
  5f:   90                      nop

  60:   90                      nop         # 4NOPs of padding is just enough to bust the uop cache before (instead of after) div, if they have to go in the uop cache.
          # But that makes little sense, because looking backward should be impossible (insn start ambiguity), and we jump into the loop so the NOPs don't even run once.
  61:   90                      nop
  62:   90                      nop
  63:   90                      nop

0000000000000064 <loop_start>:                   #uops #decode in cycle A..E
  64:   66 0f ef c0             pxor   xmm0,xmm0   #1   A
  68:   f2 48 0f 2a c1          cvtsi2sd xmm0,rcx  #2   B
  6d:   66 0f 2e f0             ucomisd xmm6,xmm0  #2   C (crosses 16B boundary)
  71:   0f 82 db 00 00 00       jb     152         #1   C

  77:   31 d2                   xor    edx,edx     #1   C
  79:   48 89 d8                mov    rax,rbx     #1   C

  7c:   48 f7 f1                div    rcx       #line  D

  # 64B boundary after the REX in next insn    
  7f:   48 85 d2                test   rdx,rdx     #1   E
  82:   74 06                   je     8a <loop_start+0x26>#1 E
  84:   48 83 c1 01             add    rcx,0x1     #1   E
  88:   79 da                   jns    64 <loop_start>#1 E

The REX prefix of test rdx,rdx is in the same block as the DIV, so this should bust the uop cache. One more byte of padding would put it into the next 32B block, which would make perfect sense. Perhaps the OP's results are wrong, or perhaps prefixes don't count, and it's the position of the opcode byte that matters. Perhaps that matters, or perhaps a macro-fused test+branch is pulled to the next block?

Macro-fusion does happen across the 64B L1I-cache line boundary, since it doesn't fall on the boundary between instructions.

Macro fusion does not happen if the first instruction ends on byte 63 of a cache line, and the second instruction is a conditional branch that starts at byte 0 of the next cache line. -- Intel's optimization manual, 2.3.2.1

Or maybe with a short encoding for one jump or the other, things are different?

Or maybe busting the uop cache has nothing to do with it, and that's fine as long as it decodes fast, which this alignment makes happen. This amount of padding just barely puts the end of UCOMISD into a new 16B block, so maybe that actually improves efficiency by letting it decode with the other instructions in the next aligned 16B block. However, I'm not sure that a 16B pre-decode (instruction-length finding) or 32B decode block have to be aligned.


I also wondered if the CPU ends up switching from uop cache to legacy decode frequently. That can be worse than running from legacy decode all the time.

Switching from the decoders to the uop cache or vice versa takes a cycle, according to Agner Fog's microarch guide. Intel says:

When micro-ops cannot be stored in the Decoded ICache due to these restrictions, they are delivered from the legacy decode pipeline. Once micro-ops are delivered from the legacy pipeline, fetching micro- ops from the Decoded ICache can resume only after the next branch micro-op. Frequent switches can incur a penalty.


The source that I assembled + disassembled:

.skip 0x5e
nop
# this is 0x5F
#nop  # OP needed 1B of padding to reach a 32B boundary

.skip 5, 0x90

.globl loop_start
loop_start:
.L37:
  pxor    %xmm0, %xmm0
  cvtsi2sdq   %rcx, %xmm0
  ucomisd %xmm0, %xmm6
  jb  .Loop_exit   // Exit the loop
.L20:
  xorl    %edx, %edx
  movq    %rbx, %rax
  divq    %rcx
  testq   %rdx, %rdx
  je  .Lnot_prime   // Failed divisibility test
  addq    $1, %rcx
  jns .L37

.skip 200  # comment this to make the jumps rel8 instead of rel32
.Lnot_prime:
.Loop_exit:
Nellienellir answered 8/10, 2016 at 4:35 Comment(16)
+1. I appreciate your determination to cite me. As for testing dppd with memory operands, you should be able to do this relatively easily now, even if it's not on IVB? My pfcdemo code in the repo has a good place for that, quickly modified. Meanwhile I'll read your reference material about the uop cache and its properties, since I know basically nothing about it.Anna
@IwillnotexistIdonotexist: My SnB system is bricked, I'm using a Core2Duo at the moment. (Surprising how non-terrible it is for running a web browser + emacs, although compiling is kinda slow).Nellienellir
FWIW, I don't think recent processors use a power-of-two function for mapping branch history. Most are using an unspecified hash of the IP, so collisions are not pathologically bad when code happens to have a specific alignment, but will still randomly occur.Gantz
There is a performance counter that tracks the legacy <-> DSB switching penalty. I think it's a 1 cycle penalty, but it applies to the front-end only, so it may not affect performance if the code isn't front end bound enough for it to matter.Gantz
@BeeOnRope: The whole point of this answer is that integer DIV is microcoded to so many uops that it is a frontend issue. Branch prediction seems extremely unlikely for a compact loop with two not-taken branches.Nellienellir
I agree branch prediction doesn't affect it. I mentioned it because the power-of-two behavior was mentioned above, and it will be relevant in some context, some day.Gantz
@PeterCordes - I added some details of what I found on Skylake below. In particular, the uop cache definitely seems to affect it: certain alignments push 1 uop into the next cache line (note, different than the next "way"), which presumably results in that uop showing up later in the IDQ and possibly ultimately slowing down the loop by one cycle. I also find a "busting the uop" cache effect as you discussed above, but its effect is the opposite of what you might expect: when the uop cache is "busted" and the code issues from MITE, we get the best performance!Gantz
@BeeOnRope: you mean the next 32B block? I follow Agner Fog's practice of using "uop cache line" to refer to a Ways, by analogy with normal set-associative caches, where one way of a set can cache an entire block. However, you make a good point that "line" is potentially ambiguous, and doesn't fit perfectly.Nellienellir
Yes, this cache doesn't work like a normal cache, where a line lives in exactly one way, so some things that synonyms in typical caches may not be here. I am using line to refer to the aligned 32B block of code in memory - this has the advantage of giving a distinct name to everything. So to clarify, I should probably say: each 32B block (which I call line) in the machine code maps to exactly one set in the uop cache, and may occupy up to 3 ways within that set. I found a performance decrease when the an instruction moved from one 32B block to the next, but not when they moved ways.Gantz
So my comment above was referring to the fact that it isn't clear if the DSB can issue instructions from multiple ways within the same set (all corresponding to the same 32B block of code) in one cycle. If it can, it may not be able to issue instructions from multiple different sets in the same cycle. I consider it unlikely it can do the latter, but I'm not sure about the former. Fairly easy to test actually...Gantz
@BeeOnRope: terminology nit-pick: "issue" has a specific technical meaning, and uops don't "issue" directly from the DSB. They're read from the DSB into the IDQ. Agner Fog says "It cannot load more than one μop cache line per clock cycle. This may be a bottleneck if many instructions use two entries each.". Also for SnB: "It can deliver a throughput of 4 (possibly fused) μops or the equivalent of 32 bytes of code per clock cycle." As I've said, I think reading the full 6 uops per clock is a new feature in Skylake, but IDK how reading only 4 of 6 wasn't a huge bottleneck in HSW.Nellienellir
@BeeOnRope: Hmm, he doesn't mention that increase for SKL. Maybe he just meant that the uop cache sustained throughput could keep up with the 4-wide issue width. I thought I'd read something about that from another source, but maybe I just read something specific that contradicted my wrong idea from Agner's guide.Nellienellir
Right, I should have said "deliver" or something. I think the meaning was clear regardless. The intel docs are explicit about the 6 uops bandwidth from DSB -> IDQ being a Skylake change, in section 2.1.1: "The DSB delivers 6 uops per cycle to the IDQ compared to 4 uops in previous generations."Gantz
@BeeOnRope: Yeah, I just found Tacit Murky's post on Agner's blog pointing out this and that the legacy decode pipeline can deliver 5 µops/cl to IDQ, 1 more than before. I don't know if there's any kind of buffering to avoid bottlenecks from a 4-2-4-2 pattern on pre-SKL if all Ways have 6 uops. (or 4-1-4-1 worst-case).Nellienellir
FWIW, Agner's guide is great, but I think the section for Skylake was pretty much added mechanically, running the per-instruction latency tests again and then copying most sections from the broadwell/haswell section. There have been a few changes noted elsewhere (including various SO posts, and Intel's docs) which are not reflected there.Gantz
@BeeOnRope: yes, I think Agner has mentioned that he was too busy with other work to do much. He still hasn't updated the SnB section about micro-fusion and un-lamination, even after I provided references to Intel's documentation of that behaviour.Nellienellir
S
-1

From what I can see in your algorithm, there is certainly not much you can do to improve it.

The problem you are hitting is probably not so much the branch to an aligned position, although that can still help, you're current problem is much more likely the pipeline mechanism.

When you write two instructions one after another such as:

  mov %eax, %ebx
  add 1, %ebx

In order to execute the second instruction, the first one has to be complete. For that reason compilers tend to mix instructions. Say you need to set %ecx to zero, you could do this:

  mov %eax, %ebx
  xor %ecx, %ecx
  add 1, %ebx

In this case, the mov and the xor can both be executed in parallel. This makes things go faster... The number of instructions that can be handled in parallel vary very much between processors (Xeons are generally better at that).

The branch adds another parameter where the best processors may start executing both sides of the branch (the true and the false...) simultaneously. But really most processors will make a guess and hope they are right.

Finally, it is obvious that converting the sqrt() result to an integer will make things a lot faster since you will avoid all that non-sense with SSE2 code which is definitively slower if used only for a conversion + compare when those two instructions could be done with integers.

Now... you are probably still wondering why the alignment does not matter with the integers. The fact is that if your code fits in the L1 instruction cache, then the alignment is not important. If you lose the L1 cache, then it has to reload the code and that's where the alignment becomes quite important since on each loop it could otherwise be loading useless code (possibly 15 bytes of useless code...) and memory access is still dead slow.

Sambo answered 29/12, 2014 at 11:9 Comment(5)
if your code fits in the L1 instruction cache, then the alignment is not important. Sometimes true, but not here. A branch target in the last couple bytes of an aligned 16B block is slightly worse than one early in a 16B block, even when it's hot in L1 cache. Close to the end of a 32B boundary is bad even if it's hot in L0 uop cache (unless you're in a loop that fits in the loop buffer).Nellienellir
Also: the best processors may start executing both sides of the branch (the true and the false...) simultaneously. No microarchitectures that I'm aware of speculate down both sides of a branch. Yes it's a theoretically possible design, but nobody does that. I'm also not sure how the first half of the answer (about instruction-level parallelism) helps at all. (And no, Xeons don't have wider out-of-order cores, or more ILP in a single thread that isn't limited by cache misses. Xeons have more cores of the same cores as i7, but that's thread-level parallelism, not instruction-level.)Nellienellir
Reordering instructions as shown in this answer has next to no effect on an out-of-order processor if decoding is not a bottleneck. It can have a negative effect because reading a register that was updated too many instructions ago, the value has to be obtained from the register file, which was a bottleneck for many generations of Intel cores starting with the Pentium M. For details, search for “register file” in agner.org/optimize/microarchitecture.pdf . The rest of the answer is vague or plain wrong as already pointed out.Bantam
@PascalCuoq, let me try to get that straight... "out-of-order is not a bottlenext" and "it can have negative effect"... and so you are saying that the order of instruction is (1) not important and (2) important. Maybe you should make up your mind?Sambo
@PascalCuoq: Intel SnB-family doesn't have register-read stalls. SnB switched to a physical register file instead of storing operand values right in the ROB. P6-family CPUs (PPro / PII to Nehalem) do have register-read stalls when an issue group needs to read too many not-recently-written registers. Pentium M is when Intel went back to P6 after the Netburst/P4 misadventure (which also used a physical register file and didn't have ROB-read stalls), but the limitation dates back to PPro. TL:DR: Alexis: out-of-order execution can find whatever parallelism there is, regardless of order.Nellienellir
M
-3

The performance difference can be explained by the different ways the instruction encoding mechanism "sees" the instructions. A CPU reads the instructions in chunks (was on core2 16 byte I believe) and it tries to give the different superscalar units microops. If the instructions are on boundaries or ordered unlikely the units in one core can starve quite easily.

Marion answered 22/9, 2016 at 23:0 Comment(7)
SnB-family CPUs (like the OP's IvyBridge CPU) have a loop buffer to recycle already-decoded uops in really short loops. See Agner Fog's microarch PDF. This answer is totally insufficient to explain anything. Just saying that "alignment can matter" doesn't add anything.Nellienellir
Yes I know that the LSD exists in intel CPU's. On top of tht the uop-cache is back from the Pentium4 times... How to explain it if this isn't the cause and if icache misses aren't the cause too? If you know everything better then you can use VTune yourself. I propably can't reproduce the exact code because the compiler is an old version (which one :D ?) and the assembly dump is not complete (not my fault)...and you commented yourself that it doesn't fit into the LSD... i dont know whats going on with youMarion
I commented on your answer before noticing that the code probably doesn't fit in the LSD. I still think your answer is either too over-simplified or just plain wrong, and not useful. Instructions don't have to be ordered in any kind of pattern that matches the execution units.Nellienellir
I think it might be switching between the decoders and the uop cache here, if the uops for IDIV won't fit in the cache lines for the loop. The OP's asm loop is complete enough to microbenchmark in a stand alone .S file if you have similar hardware (but I don't, unfortunately). I hadn't realized that integer division could bottleneck on the frontend instead of the divide unit, but a sufficient answer to this is going to have to mention the uop cache, I think. The OP already knows that alignment matters.Nellienellir
Hm then I haven't explained this good enough... intel.com/content/dam/www/public/us/en/documents/manuals/… page 45... the out of order (OOO) engine just has 5 ports, and page 46 ... "An instruction fetch is a 16-byte aligned lookup th rough the ITLB and into the instruction cache"... further see "Instruction Decode" page 47 ... so if the instructions are on the next 16-byte "line" then it has to wait at least one cycle more... It's hard to prove this, but I am really eager to hear whch other reason can have such an effectMarion
I've been working on an answer, but I realized that the OP did leave out too much. We can't tell which of those branches are rel8 and which are rel32, so there are 4 possibilities to look at for the alignment-point cutoffs. i.e. assemble it with a skip 0x5e; nop to reach ...5F, then .skip 5, 0x90 to see the smallest padding that makes it fast. That's one short of making the instruction after DIV start in the next 32B block, which I think should allow the block containing the DIV to fit in 3 uop cache lines (if the JB is a 6-byte rel32). Anyway, 4x the search space sucks :(Nellienellir
I finally posted the answer I'd been procrastinating about. It's a bit more specific than this, even though it's still a guess...Nellienellir

© 2022 - 2024 — McMap. All rights reserved.