Asked 18/11, 2016 at 15:58 Answered 18/12, 2021 at 2:52

Solved performance optimization x86 intel cpu-architecture

Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops¹) and then schedule these uops out-of-order as their inputs become ready. While the basic idea is clear, I'd like to know the specific details of how ready instructions are scheduled, since it impacts micro-optimization decisions.

For example, take the following toy loop²:

top:
lea eax, [ecx + 5]
popcnt eax, eax
add edi, eax
dec ecx
jnz top

this basically implements the loop (with the following correspondence: eax -> total, c -> ecx):

do {
  total += popcnt(c + 5);
} while (--c > 0);

I'm familiar with the process of optimizing any small loop by looking at the uop breakdown, dependency chain latencies, and so on. In the loop above we have only one carried dependency chain: dec ecx. The first three instructions of the loop (lea, popcnt, add) are part of a dependency chain that starts fresh each loop.

The final dec and jne are fused. So we have a total of 4 fused-domain uops, and one only loop-carried dependency chain with a latency of 1 cycle. So based on that criteria, it seems that the loop can execute at 1 cycle/iteration.

However, we should look at the port pressure too:

The lea can execute on ports 1 and 5
The popcnt can execute on port 1
The add can execute on port 0, 1, 5 and 6
The predicted-taken jnz executes on port 6

So to get to 1 cycle/iteration, you pretty much need the following to happen:

The popcnt must execute on port 1 (the only port it can execute on)
The lea must execute on port 5 (and never on port 1)
The add must execute on port 0, and never on any of the other three ports it can execute on
The jnz can only execute on port 6 anyway

That's a lot of conditions! If instructions just got scheduled randomly, you could get a much worse throughput. For example, 75% of the add would go to port 1, 5 or 6, which would delay the popcnt, lea or jnz by one cycle. Similarly for the lea which can go to 2 ports, one shared with popcnt.

IACA on the other hand reports a result very close to optimal, 1.05 cycles per iteration:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - l.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.05 Cycles       Throughput Bottleneck: FrontEnd, Port0, Port1, Port5

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  | 0.9  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea eax, ptr [ecx+0x5]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | popcnt eax, eax
|   1    | 0.1       |     |           |           |     | 0.1 | 0.9 |     | CP | add edi, eax
|   1    | 0.9       |     |           |           |     |     | 0.1 |     | CP | dec ecx
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffff4

It pretty much reflects the necessary "ideal" scheduling I mentioned above, with a small deviation: it shows the add stealing port 5 from the lea on 1 out of 10 cycles. It also doesn't know that the fused branch is going to go to port 6 since it is predicted taken, so it puts most of the uops for the branch on port 0, and most of the uops for the add on port 6, rather than the other way around.

It's not clear if the extra 0.05 cycles that IACA reports over the optimal is the result of some deep, accurate analysis, or a less insightful consequence of the algorithm it uses, e.g., analyzing the loop over a fixed number of cycles, or just a bug or whatever. The same goes for the 0.1 fraction of a uop that it thinks will go to the non-ideal port. It is also not clear if one explains the other - I would think that mis-assigning a port 1 out of 10 times would cause a cycle count of 11/10 = 1.1 cycles per iteration, but I haven't worked out the actual downstream results - maybe the impact is less on average. Or it could just be rounding (0.05 == 0.1 to 1 decimal place).

So how do modern x86 CPUs actually schedule? In particular:

When multiple uops are ready in the reservation station, in what order are they scheduled to ports?
When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?
If any of the answers involve a concept like oldest to choose among uops, how is it defined? Age since it was delivered to the RS? Age since it became ready? How are ties broken? Does program order ever come into it?

Results on Skylake

Let's measure some actual results on Skylake to check which answers explain the experimental evidence, so here are some real-world measured results (from perf) on my Skylake box. Confusingly, I'm going switch to using imul for my "only executes on one port" instruction, since it has many variants, including 3-argument versions that allow you to use different registers for the source(s) and destination. This is very handy when trying to construct dependency chains. It also avoids the whole "incorrect dependency on destination" that popcnt has.

Independent Instructions

Let's start by looking at the simple (?) case that the instructions are relatively independent - without any dependency chains other than trivial ones like the loop counter.

Here's a 4 uop loop (only 3 executed uops) with mild pressure. All instructions are independent (don't share any sources or destinations). The add could in principle steal the p1 needed by the imul or p6 needed by the dec:

Example 1

instr   p0 p1 p5 p6 
xor       (elim)
imul        X
add      X  X  X  X
dec               X

top:
    xor  r9, r9
    add  r8, rdx
    imul rax, rbx, 5
    dec esi
    jnz top

The result is that this executes with perfect scheduling at 1.00 cycles/iteration:

   560,709,974      uops_dispatched_port_port_0                                     ( +-  0.38% )
 1,000,026,608      uops_dispatched_port_port_1                                     ( +-  0.00% )
   439,324,609      uops_dispatched_port_port_5                                     ( +-  0.49% )
 1,000,041,224      uops_dispatched_port_port_6                                     ( +-  0.00% )
 5,000,000,110      instructions:u            #    5.00  insns per cycle          ( +-  0.00% )
 1,000,281,902      cycles:u   

                                           ( +-  0.00% )

As expected, p1 and p6 are fully utilized by the imul and dec/jnz respectively, and then the add issues roughly half and half between the remaining available ports. Note roughly - the actual ratio is 56% and 44%, and this ratio is pretty stable across runs (note the +- 0.49% variation). If I adjust the loop alignment, the split changes (53/46 for 32B alignment, more like 57/42 for 32B+4 alignment). Now, we if change nothing except the position of imul in the loop:

Example 2

top:
    imul rax, rbx, 5
    xor  r9, r9
    add  r8, rdx
    dec esi
    jnz top

Then suddenly the p0/p5 split is exactly 50%/50%, with 0.00% variation:

   500,025,758      uops_dispatched_port_port_0                                     ( +-  0.00% )
 1,000,044,901      uops_dispatched_port_port_1                                     ( +-  0.00% )
   500,038,070      uops_dispatched_port_port_5                                     ( +-  0.00% )
 1,000,066,733      uops_dispatched_port_port_6                                     ( +-  0.00% )
 5,000,000,439      instructions:u            #    5.00  insns per cycle          ( +-  0.00% )
 1,000,439,396      cycles:u                                                        ( +-  0.01% )

So that's already interesting, but it's hard to tell what's going on. Perhaps the exact behavior depends on the initial conditions at loop entry and is sensitive to ordering within the loop (e.g. because counters are used). This example shows that something more than "random" or "stupid" scheduling is going on. In particular, if you just eliminate the imul instruction from the loop, you get the following:

Example 3

   330,214,329      uops_dispatched_port_port_0                                     ( +-  0.40% )
   314,012,342      uops_dispatched_port_port_1                                     ( +-  1.77% )
   355,817,739      uops_dispatched_port_port_5                                     ( +-  1.21% )
 1,000,034,653      uops_dispatched_port_port_6                                     ( +-  0.00% )
 4,000,000,160      instructions:u            #    4.00  insns per cycle          ( +-  0.00% )
 1,000,235,522      cycles:u                                                      ( +-  0.00% )

Here, the add is now roughly evenly distributed among p0, p1 and p5 - so the presence of the imul did affect the add scheduling: it wasn't just a consequence of some "avoid port 1" rule.

Note here that total port pressure is only 3 uops/cycle since the xor is a zeroing idiom and is eliminated in the renamer. Let's try with the max pressure of 4 uops. I expect whatever mechanism kicked in above to be able to perfectly schedule this also. We only change xor r9, r9 to xor r9, r10, so it is no longer a zeroing idiom. We get the following results:

Example 4

top:
    xor  r9, r10
    add  r8, rdx
    imul rax, rbx, 5
    dec esi
    jnz top

       488,245,238      uops_dispatched_port_port_0                                     ( +-  0.50% )
     1,241,118,197      uops_dispatched_port_port_1                                     ( +-  0.03% )
     1,027,345,180      uops_dispatched_port_port_5                                     ( +-  0.28% )
     1,243,743,312      uops_dispatched_port_port_6                                     ( +-  0.04% )
     5,000,000,711      instructions:u            #    2.66  insns per cycle            ( +-  0.00% )
     1,880,606,080      cycles:u                                                        ( +-  0.08% )

Oops! Rather than evenly scheduling everything across p0156, the scheduler has underused p0 (it's only executing something ~49% of cycles), and hence p1 and p6 are oversubscribed because they are executing both their required ops of imul and dec/jnz. This behavior, I think is consistent with a counter-based pressure indicator as hayesti indicated in their answer, and with uops being assigned to a port at issue-time, not at execution time as both hayesti and Peter Cordes mentioned. That behavior³ makes the execute the oldest ready uops rule not nearly as effective. If uops weren't bound to execution ports at issue, but rather at execution, then this "oldest" rule would fix the problem above after one iteration - once one imul and one dec/jnz got held back for a single iteration, they will always be older than the competing xor and add instructions, so should always get scheduled first. One thing I am learning though, is that if ports are assigned at issue time, this rule doesn't help because the ports are pre-determined at issue time. I guess it still helps a bit in favoring instructions that are part of long dependency chains (since these will tend to fall behind), but it's not the cure-all I thought it was.

That also seems to be an explanation of the results above: p0 gets assigned more pressure than it really has because the dec/jnz combo can in theory execute on p06. In fact because the branch is predicted taken it only ever goes to p6, but perhaps that info can't feed into the pressure balancing algorithm, so the counters tend to see equal pressure on p016, meaning that the add and the xor get spread around differently than optimal.

Probably we can test this, by unrolling the loop a bit so the jnz is less of a factor...

¹ OK, it is properly written μops, but that kills search-ability and to actually type the "μ" character I'm usually resorting to copy-pasting the character from a webpage.

² I had originally used imul instead of popcnt in the loop, but, unbelievably, IACA doesn't support it!

³ Please note that I'm not suggesting this is a poor design or anything - there are probably very good hardware reasons why the scheduler cannot easily make all its decisions at execution time.

Tonguetied answered 18/11, 2016 at 15:58 Comment(15)

This is quite a broad topic, and likely varies, perhaps even significantly, between processor families and maybe even different steppings in the same family; might even be dependent on the level of microcode loaded into the CPU... – Liking 18/11, 2016 at 18:23

What IPC do you get when you run this code? That should help you determine if the IACA report is accurate. – Olen 18/11, 2016 at 18:36

Ok ok, I'll admit it. I f***ing love your x86 questions and upvote most of them, because it's exactly the sort of stuff I don't dare to ask. – Bradley 19/11, 2016 at 1:19

@IwillnotexistIdonotexist - you _should dare to ask! What's the worst that can happen - no one willing or able to answer the question, I guess? – Tonguetied 19/11, 2016 at 19:41

@Liking - it's possible that it could vary a ton in across every stepping or even across microcode versions as you suggest, but if that were true it would be significantly different than most other architectural features across a wide variety of CPUs. In general, there are only a few relatively well-understood ways to implement each major component, and although these change from time to time, it's not unusual to see consistent behavior for 10 years or more. That said if you have evidence of rapidly changing scheduler behavior it would also be very interesting! – Tonguetied 19/11, 2016 at 19:46

Have you looked at stackoverflow.com/questions/34309707/…? Nobody could figure out why Broadwell doesn't achieve the theoretical max uop throughput, but HSW and SKL do. – Translator 19/11, 2016 at 20:9

@PeterCordes - yes, I have, but not recently. Do you think the good-bad-then-good-again behavior is related to scheduler changes? – Tonguetied 19/11, 2016 at 20:32

@BeeOnRope: yes, and/or latency changes that give the scheduler different inputs to work with. (Except that IIRC the perf dip on BDW is with FMA which didn't change latency, and MULPS (which did drop to 3c) still ran fast even though the results have to hang around longer). I haven't tried to prove it, or find other cases with different instructions for a similar mix of ports that fail to run at max throughput on BDW, since I don't have a BDW CPU myself, but some kind of badly-handled corner case for the scheduler seems the most plausible explanation. – Translator 19/11, 2016 at 20:39

@GabrielSouthern - I finally got around to adding some results using perf. They definitely show that at least in some cases IACA is way optimistic. Even in fairly simple-to-schedule cases (no dep chains) there is significant mis-scheduling, which nearly doubles the runtime. – Tonguetied 19/11, 2016 at 22:44

I think it's pretty obvious that optimal scheduling is far from being practical on a HW (especially when you get last-minute bypasses and wakeups that you want to include in your scheduling arbitration). Best you can hope for are some static heuristics that hopefully work most of the time. Awesome question, by the way. – Luannaluanne 21/11, 2016 at 21:22

Great question. "In fact because the branch is predicted taken it only ever goes to p6" So predicated taken branches can only go to p6, not p0? What about predicated not taken branches? I thought p0 and p6 both can handle any branch instruction. IACA seems to indicate that. Typo in i -> ecx, should be c -> ecx. I was doing a similar analysis to a slightly different piece of code and I'm not able to figure out how or why it's being scheduled in a suboptimal way, and then I stumbled upon this question. – Platinic 29/5, 2018 at 17:49

@HadiBrais Typo fixed, thanks. Yeah, according to Agner's table, predicted taken branches (and presumably things like unconditional jumps) only go to p6, not p0. Same for call. p0 is only able to handle conditional jumps that are (predicted) not taken. I added a test to uarch-bench just now to illustrate this. Run with

--timer=libpfc --test-name=misc/*tight* --extra-events=UOPS_DISPATCHED.PORT_0,UOPS_DISPATCHED.PORT_1,UOPS_DISPATCHED.PORT_5,UOPS_DISPATCHED.PORT_6

... – Tonguetied 29/5, 2018 at 19:56

... it shows that only in the not taken case (tight_loop3) are any uops handled by p0. Perhaps p6 is the only one capable of maintaining a renamed version of rip or otherwise updating whatever structure handles changes to rip other than the normal flow of instructions. – Tonguetied 29/5, 2018 at 19:58

re:"The first three instructions of the loop (lea, imul, add)". imul -> popcnt. – Sidky 13/3, 2021 at 23:45

@Sidky - thanks, fixed! – Tonguetied 15/3, 2021 at 21:14

Your questions are tough for a couple of reasons:

The answer depends a lot on the microarchitecture of the processor which can vary significantly from generation to generation.
These are fine-grained details which Intel doesn't generally release to the public.

Nevertheless, I'll try to answer...

When multiple uops are ready in the reservation station, in what order are they scheduled to ports?

It should be the oldest [see below], but your mileage may vary. The P6 microarchitecture (used in the Pentium Pro, 2 & 3) used a reservation station with five schedulers (one per execution port); the schedulers used a priority pointer as a place to start scanning for ready uops to dispatch. It was only pseudo FIFO so it's entirely possible that the oldest ready instruction was not always scheduled. In the NetBurst microarchitecture (used in Pentium 4), they ditched the unified reservation station and used two uop queues instead. These were proper collapsing priority queues so the schedulers were guaranteed to get the oldest ready instruction. The Core architecture returned to a reservation station and I would hazard an educated guess that they used the collapsing priority queue, but I can't find a source to confirm this. If somebody has a definitive answer, I'm all ears.

When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?

That's tricky to know. The best I could find is a patent from Intel describing such a mechanism. Essentially, they keep a counter for each port that has redundant functional units. When the uops leave the front end to the reservation station, they are assigned a dispatch port. If it has to decide between multiple redundant execution units, the counters are used to distribute the work evenly. Counters are incremented and decremented as uops enter and leave the reservation station respectively.

Naturally this is just a heuristic and does not guarantee a perfect conflict-free schedule, however, I could still see it working with your toy example. The instructions which can only go to one port would ultimately influence the scheduler to dispatch the "less restricted" uops to other ports.

In any case, the presence of a patent doesn't necessarily imply that the idea was adopted (although that said, one of the authors was also a tech lead of the Pentium 4, so who knows?)

If any of the answers involve a concept like oldest to choose among uops, how is it defined? Age since it was delivered to the RS? Age since it became ready? How are ties broken? Does program order ever come into it?

Since uops are inserted into the reservation station in order, oldest here does indeed refer to time it entered the reservation station, i.e. oldest in program order.

By the way, I would take those IACA results with a grain of salt as they may not reflect the nuances of the real hardware. On Haswell, there is a hardware counter called uops_executed_port that can tell you how many cycles in your thread were uops issues to ports 0-7. Maybe you could leverage these to get a better understanding of your program?

Discussant answered 19/11, 2016 at 0:56 Comment(18)

Awesome. Since multiple uops enter the RS in one cycle (which I think is 4 fused domain uops today) it isn't clear to me if all uops in one batch have the same age, or could ties be broken by program order in that case? That's what I was getting at with program order above: whether swapping two instructions always swaps their apparent age in the RS, even if delivered to the RS in the same cycle. – Tonguetied 19/11, 2016 at 1:41

I used my own software to examine the counter values. For 1000000000 iterations, I got: p0156 = 496468755, 1180628985, 1111275830, 1211773267 uops executed. – Bradley 19/11, 2016 at 1:41

That particular issue is important for optimization because it differentiates between the case where you just care about relative order to affect scheduling, or whether you need to follow rules like "a uop X needs to appear at least 4 uops before another uop Y to guarantee that it is older". – Tonguetied 19/11, 2016 at 1:44

@Tonguetied From my Haswell counters using libpfc, I get that uops are issued in a pattern 4-4-4-0-4-4-4-0... Cycle count is 33% higher than the minimum possible. Alloyed to the fact that the only instruction with a latency >1 in the loop is popcnt (lat 3), I'm inclined to believe that popcnt is always stalled for 1 cycle by an add or lea mis-issued to p1 in the same cycle, but is always given priority access to p1 on next cycle, since that's the only port that can take popcnt. – Bradley 19/11, 2016 at 1:55

@Tonguetied All uops are generated in a deterministic order from x86 instructions and all x86 instructions are ordered with respect to each other. No matter how wide your pipeline is, there is always an "oldest" uop among a batch entering the reservation stations and this is determined by program order. – Discussant 19/11, 2016 at 1:55

Thanks - that's what I was getting at: it wasn't clear if oldest was just measured by the cycle entering the RS, which seems easiest, or if there was additional tie-breaking even among those instructions based on program order. – Tonguetied 19/11, 2016 at 2:2

@IwillnotexistIdonotexist - what counters are you using to determine the issue pattern? – Tonguetied 19/11, 2016 at 19:47

@Tonguetied uops_issued.any<1, >=1, >=2, >=3, >=4 gives you a cumulative histogram that shows the uops being issued 4-at-a-time on almost all cycles that any uops are issued. – Bradley 19/11, 2016 at 19:53

@BeeOnRope: other facts I've read from at least somewhat trustworthy sources (Agner Fog, or Intel's manual, I forget which comes from where): 1) uops are allocated to ports at issue time. 2) scheduling attempts to avoid write-back conflicts when uops with different latencies are involved. (This is why SnB-family standardizes uop latencies to 1, 3 and 5 cycles, and groups them by port that way. Except that SKL has some 4c uops, but still no 2c uops). – Translator 19/11, 2016 at 20:7

@PeterCordes - your comment that ports are assigned at issue time turned on a light bulb for me. In fact, hayesti mentions it above too, but I kind of glossed over it. That makes the execution-time scheduling a lot less flexible and explains why you'd need some type of counter thing to assign the ports reasonably at issue. I added a bunch of results from Skylake which are interesting, and perhaps point towards a counter-like implementation (but I haven't run the exact numbers yet). In particular, it definitely mis-schedules even pretty simple (?) cases. – Tonguetied 19/11, 2016 at 22:46

@IwillnotexistIdonotexist - what code were you measuring? What does libpfc say about the port usage? I got some numbers using perf, but I want to try with libpfc too. I added my results above. – Tonguetied 19/11, 2016 at 22:46

@Tonguetied Exactly your question's snippet, except headed by mov edi, 0; mov rcx, 1000000000;. The port readings for p0156 are in the first comment I made below this post; p2347 gave negligible readings, as one would expect. – Bradley 19/11, 2016 at 23:59

Accepting this one, since it points out that patent which helped a lot, and the port-assignment-at-issue fact, which changed my understanding. I also added some of my findings in another answer, many of which align with the implementation suggested by hayesti - but still don't explain the poor behavior of Example 4. – Tonguetied 23/11, 2016 at 1:13

@Tonguetied Please correct me if I misunderstood something, but the following thing is not quite clear. Renamer does the binding uops to dispatch ports, but the actual dispatching is done by the RS. Consider load uops that is typically dispatched either to p2 or p3. Since RS is waiting untill all uop's operands are ready I think it might happen that the port assignment that is done by the Renamer does not distribute uops uniformly across available ports. I ran some benchmarks of code that does lots of loading and ended up that p2, p3 accepted almost equal number of uops. – Lifeless 14/2, 2020 at 10:48

13 493 383 038 uops_dispatched_port.port_2 and 13 494 860 751 uops_dispatched_port.port_3. So choosing the port number is done by the RS anyway. IOM/2.5.3.1, Renamer defines that

In this process, the out-of-order core carries out the following steps: Binds the micro-op to an appropriate dispatch port

. – Lifeless 14/2, 2020 at 10:49

@Lifeless - I do not really understand your question/concern. Are you saying the renamer does or does not bind to the port? Your numbers seem to support either approach. – Tonguetied 15/2, 2020 at 17:40

@Tonguetied Yes, I initially thought that Scheduler (aka RS) chooses ports to dispatch uops to. But actually it is done by the Renamer. Scheduler queues uops and waits till their operands are ready. – Lifeless 15/2, 2020 at 17:50

@St correct, I had the same misunderstanding at first. – Tonguetied 15/2, 2020 at 18:21

Here's what I found on Skylake, coming at it from the angle that uops are assigned to ports at issue time (i.e., when they are issued to the RS), not at dispatch time (i.e., at the moment they are sent to execute). Before I had understood that the port decision was made at dispatch time.

I did a variety of tests which tried to isolate sequences of add operations that can go to p0156 and imul operations which go only to port 0. A typical test goes something like this:

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

... many more mov instructions

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

... many more mov instructions

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

Basically there is a long lead-in of mov eax, [edi] instructions, which only issue on p23 and hence don't clog up the ports used by the instructions (I could have also used nop instructions, but the test would be a bit different since nop don't issue to the RS). This is followed by the "payload" section, here composed of 4 imul and 12 add, and then a lead-out section of more dummy mov instructions.

First, let's take a look at the patent that hayesti linked above, and which he describes the basic idea about: counters for each port that track the total number of uops assigned to the port, which are used to load balance the port assignments. Take a look at this table included in the patent description:

This table is used to pick between p0 or p1 for the 3-uops in an issue group for the 3-wide architecture discussed in the patent. Note that the behavior depends on the position of the uop in the group, and that there are 4 rules¹ based on the count, which spread the uops around in a logical way. In particular, the count needs to be at +/- 2 or greater before the whole group gets assigned the under-used port.

Let's see if we can observe the "position in issue group" matters behavior on Sklake. We use a payload of a single add like:

add edx, 1     ; position 0
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

... and we slide it around inside the 4 instruction chuck like:

mov eax, [edi]
add edx, 1      ; position 1
mov eax, [edi]
mov eax, [edi]

... and so on, testing all four positions within the issue group². This shows the following, when the RS is full (of mov instructions) but with no port pressure of any of the relevant ports:

The first add instructions go to p5 or p6, with the port selected usually alternating as the instruction is slow down (i.e., add instructions in even positions go to p5 and in odd positions go to p6).
The second add instruction also goes to p56 - whichever of the two the first one didn't go to.
After that further add instructions start to be balanced around p0156, with p5 and p6 usually ahead but with things fairly even overall (i.e., the gap between p56 and the other two ports doesn't grow).

Next, I took a look at what happens if load up p1 with imul operations, then first in a bunch of add operations:

imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

The results show that the scheduler handles this well - all of the imul got to scheduled to p1 (as expected), and then none of the subsequent add instructions went to p1, being spread around p056 instead. So here the scheduling is working well.

Of course, when the situation is reversed, and the series of imul comes after the adds, p1 is loaded up with its share of adds before the imuls hit. That's a result of port assignment happening in-order at issue time, since is no mechanism to "look ahead" and see the imul when scheduling the adds.

Overall the scheduler looks to do a good job in these test cases.

It doesn't explain what happens in smaller, tighter loops like the following:

sub r9, 1
sub r10, 1
imul ebx, edx, 1
dec ecx
jnz top

Just like Example 4 in my question, this loop only fills p0 on ~30% of cycles, despite there being two sub instructions that should be able to go to p0 on every cycle. p1 and p6 are oversubscribed, each executing 1.24 uops for every iteration (1 is ideal). I wasn't able to triangulate the difference between the examples that work well at the top of this answer with the bad loops - but there are still many ideas to try.

I did note that examples without instruction latency differences don't seem to suffer from this issue. For example, here's another 4-uop loop with "complex" port pressure:

top:
    sub r8, 1
    ror r11, 2
    bswap eax
    dec ecx
    jnz top

The uop map is as follows:

instr   p0 p1 p5 p6 
sub      X  X  X  X
ror      X        X
bswap       X  X   
dec/jnz           X

So the sub must always go to p15, shared with bswap if things are to work out. They do:

Performance counter stats for './sched-test2' (2 runs):

   999,709,142      uops_dispatched_port_port_0                                     ( +-  0.00% )
   999,675,324      uops_dispatched_port_port_1                                     ( +-  0.00% )
   999,772,564      uops_dispatched_port_port_5                                     ( +-  0.00% )
 1,000,991,020      uops_dispatched_port_port_6                                     ( +-  0.00% )
 4,000,238,468      uops_issued_any                                               ( +-  0.00% )
 5,000,000,117      instructions:u            #    4.99  insns per cycle          ( +-  0.00% )
 1,001,268,722      cycles:u                                                      ( +-  0.00% )

So it seems that the issue may be related to instruction latencies (certainly, there are other differences between the examples). That's something that came up in this similar question.

¹The table has 5 rules, but the rule for 0 and -1 counts are identical.

²Of course, I can't be sure where the issue groups start and end, but regardless we test four different positions as we slide down four instructions (but the labels could be wrong). I'm also not sure the issue group max size is 4 - earlier parts of the pipeline are wider - but I believe it is and some testing seemed to show it was (loops with a multiple of 4 uops showed consistent scheduling behavior). In any case, the conclusions hold with different scheduling group sizes.

Tonguetied answered 23/11, 2016 at 1:9 Comment(12)

Why [edi]? Did you want to pad the code with address-size prefixes for some reason? – Translator 23/11, 2016 at 4:21

@PeterCordes - no specific reason. They are dummy movs. It could just as well be [rdi]. The location pointed to is in .rodata and is loaded in the low 32-bits. – Tonguetied 23/11, 2016 at 15:14

I added another little test at the end that seems to show this issue usually does not occur if all instructions have the same latency (1). – Tonguetied 23/11, 2016 at 15:23

Regarding issue time scheduling. Does a microfused load + alu (that stays laminated) schedule the load and alu port at the time time or is the alu not scheduled until later? If later is there any idea about when? – Sidky 15/3, 2021 at 3:51

@Sidky - they act as independent uops in the out of order part of the core, as if you had separate load and ALU instructions. Only up to rename (and at retire) do they act as one. – Tonguetied 15/3, 2021 at 20:59

@Tonguetied so that means that for an L1 cache hit load + alu the alu will be scheduled on whatever port it chose for at least 5 cycles? – Sidky 15/3, 2021 at 21:47

@Sidky - yes, at least for a specific definition of "scheduled on whatever port it chose". I wouldn't really put it that was since it seems like it is using the port or something. It's not wrong though: any op that is scheduled (enters the scheduler) and which depends on a load that is scheduled at the same time will necessarily wait in the scheduler for at least 4 or 5 cycles, since there is no way all its operands will be ready before then. This doesn't interfere with other ops that want to use that same port in the meantime, of course, except if you reach the scheduler capacity in \ – Tonguetied 16/3, 2021 at 0:36

which case you can argue that the op (in coordination with all the other waiting ops) is possible blocking some other ops that might otherwise be ready to run. – Tonguetied 16/3, 2021 at 0:38

To be clear though, this behavior is still exactly like two separate instructions that load and use the loaded value: there is no disadvantage with respect to scheduling for laminated micro-fused ops. There is a potential disadvantage for unlaminated micro-fused ops vs 2 instructions, however: they must be part of the same allocation/rename group, which can cause bubbles. – Tonguetied 16/3, 2021 at 0:45

@Tonguetied a meaningful difference I just came across: microfused uops only count as 1 in the renamer on SKL (and newer I'd assume), so if code is bottlenecked by renaming it can be a true advantage. – Sidky 25/5, 2021 at 20:47

@Noah, yes there's absolutely an advantage to micro-fused ops: they count as only one against the 4 (5 on ICL and later) limit on all modern Intel (and AMD). Above I saying there is no counterbalancing disadvantage in most cases, when it comes to scheduling. Definitely use fused uops if you can. – Tonguetied 25/5, 2021 at 22:43

@Tonguetied I am wondering how instruction latency affects the way that x86 uops are scheduled. In the section 2.12 of one recent paper, it says that uop will go to the port with fewest non-executed uops. Is the term 'non-executed' related to latency? But this theroy can't explain the example 4 in the question description. According to the theory, imul's latency is higher -> p1 gets more non-executed uops -> p1's active cycles should be less, which contradict the experiment data. – Mediate 17/12, 2021 at 6:59

Section 2.12 of Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures[^1] explains how port are assigned, though it fails to explain example 4 in the question description. I also failed to figure out what role Latency plays in the port assignment.

Previous work [19, 25, 26] has identified the ports that the µops of individual instructions can use. For µops that can use more than one port, it was, however, previously unknown how the actual port is chosen by the processor. We reverse-engineered the port assignment algorithm using microbenchmarks. In the following, we describe our findings for CPUs with eight ports; such CPUs are currently most widely used.

The ports are assigned when the µops are issued by the renamer to the scheduler. In a single cycle, up to four µops can be issued. In the following, we will call the position of a µop within a cycle an issue slot; e.g., the oldest instruction issued in a cycle would occupy issue slot 0.

The port that a µop is assigned depends on its issue slot and on the ports assigned to µops that have not been executed and were issued in a previous cycle.

In the following, we will only consider µops that can use more than one port. For a given µop m, let $P_{min}$ be the port to which the fewest non-executed µops have been assigned to from among the ports that m can use. Let $P_{min'}$ be the port with the second smallest usage so far. If there is a tie among the ports with the smallest (or second smallest, respectively) usage, let $P_{min}$ (or $P_{min'}$) be the port with the highest port number from among these ports (the reason for this choice is probably that ports with higher numbers are connected to fewer functional units). If the difference between $P_{min}$ and $P_{min'}$ is greater or equal to 3, we set $P_{min'}$ to $P_{min}$.

The µops in issue slots 0 and 2 are assigned to port $P_{min}$ The µops in issue slots 1 and 3 are assigned to port $P_{min'}$.

A special case is µops that can use port 2 and port 3. These ports are used by µops that handle memory accesses, and both ports are connected to the same types of functional units. For such µops, the port assignment algorithm alternates between port 2 and port 3.

I tried to find out whether $P_{min}$ and $P_{min'}$ are shared between threads (Hyper-Threading), namely whether one thread can affect the port assignment of another one in the same core.

Just split the code used in BeeOnRope's answer into two threads.

thread1:
.loop:
    imul rax, rbx, 5
    jmp .loop

thread2:
    mov esi,1000000000
    .top:
    bswap eax
    dec  esi
    jnz  .top
    jmp thread2

Where instructions bswap can be executed on ports 1 and 5, and imul r64, R64, i on port 1. If counters were shared between threads, you would see bswap executed on port 5 and imul executed on port 1.

The experiment was recorded as follows, where ports P0 and P5 on thread 1 and p0 on thread 2 should have recorded a small amount of non-user data, but without hindering the conclusion. It can be seen from the data that the bswap instruction of thread 2 is executed alternately between ports P1 and P5 without giving up P1.

port	thread 1 active cycles	thread 2 active cycles
P0	63,088,967	68,022,708
P1	180,219,013,832	95,742,764,738
P5	63,994,200	96,291,124,547
P6	180,330,835,515	192,048,880,421
total	180,998,504,099	192,774,759,297

Therefore, the counters are not shared between threads.

This conclusion does not conflict with SMotherSpectre[^2], which uses time as the side channel. (For example, thread 2 waits longer on port 1 to use port 1.)

Executing instructions that occupy a specific port and measuring their timing enables inference about other instructions executing on the same port. We first choose two instructions, each scheduled on a single, distinct, execution port. One thread runs and times a long sequence of single µop instructions scheduled on port a, while simultaneously the other thread runs a long sequence of instructions scheduled on port b. We expect that, if a = b, contention occurs and the measured execution time is longer compared to the a ≠ b case.

[^1]: Abel, Andreas, and Jan Reineke. "Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures." arXiv preprint arXiv:2107.14210 (2021).

[^2]: Bhattacharyya, Atri, Alexandra Sandulescu, Matthias Neugschwandtner, Alessandro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kurmus. “SMoTherSpectre: Exploiting Speculative Execution through Port Contention.” Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, November 6, 2019, 785–800. https://doi.org/10.1145/3319535.3363194.

Mediate answered 18/12, 2021 at 2:52 Comment(4)

What microarch did you test on? Intel SnB-family competitively shares the RS, I think, so I'm a bit surprised that scheduling doesn't account for the other logical core's uops. But I guess from a fairness / starvation PoV, you don't want one thread to hog one port? But no, if a uop needs a port, it will get scheduled there regardless. Maybe there are per-thread counters for other reasons, like as part of the machinery to let lfence drain only this thread's uops from the RS/ROB? – Translator 18/12, 2021 at 3:51

Anyway, interesting result, good experiment. But I wonder if unrolling some would matter: half your uops are taken branches which can only exec on port 6, so that's the bottleneck, not port 1. – Translator 18/12, 2021 at 3:52

@PeterCordes I am using i7-10700 and google says it is comet lake. Every time I look up the instruction table of agner, I am wondering which part I should use - skylake, coffee lake, or cannon lake? I think both options (share or not) have their advantages and disadvantages, especially from a security perspective. Besides, I re-run the test with imul and bswap repeating 39 times in a single loop. The conclusion hasn't changed. – Mediate 18/12, 2021 at 6:29

The actual cores in Comet Lake and Coffee Lake are all the same Skylake microarchitecture, just on different manufacturing processes. (en.wikichip.org/wiki/intel/microarchitectures/comet_lake). (With maybe some changes to (partially?) fix or mitigate things like Meltdown or Spectre). The memory controllers and iGPU are better, and there are more cores in the top-end models, but inside the core AFAIK there's no meaningful change. IDK why Agner did another test for Coffee Lake separate from Skylake. – Translator 18/12, 2021 at 7:28

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++