In general, see How are x86 uops scheduled, exactly? - the hardware can't see (or isn't looking at) future issue groups when allocating uops for this group to the ports with the shortest queues. And maybe is "greedy" even within the current issue group? But none of that explains why none of the four uops get allocated to port 0.
In practice, "cold" startup would only happen after an I-cache miss, interrupt return, or similar thing that drained the RS (scheduler). So it's hard to measure whether real HW behaves the same way uiCA simulates, although perhaps we could get somewhere with lfence
to drain the ROB+RS and look at performance counters for uops that come right after it.
Maybe worth checking the uiCA source code to see if there are any comments describing the logic for that case: https://github.com/andreas-abel/uiCA/blob/9cbbe931247f45f756738cf35800b5e8dff7bbb0/uiCA.py#L1127
addNewUops
looks like the function that adds an issue-group of new uops to the back end. It iterates the uops of an issue-group in sequence:
def addNewUops(self, clock, newUops):
self.portUsageAtStartOfCycle[clock] = dict(self.portUsage)
portCombinationsInCurCycle = {}
for issueSlot, fusedUop in enumerate(newUops):
for uop in fusedUop.getUnfusedUops():
...
I think Haswell would be using the elif len(allPorts[self.uArchConfig.name]) == 8:
block, since it has ports 0..7, unlike Ice Lake which has execution 10 ports or Sandybridge which only has p0..5.
elif len(allPorts[self.uArchConfig.name]) == 8:
applicablePortUsages = [(p,u) for p, u in self.portUsageAtStartOfCycle[clock].items() if p in uop.prop.possiblePorts]
minPort, minPortUsage = min(applicablePortUsages, key=lambda x: (x[1], -int(x[0], 16))) # port with minimum usage so far
if uop.prop.possiblePorts == ['2', '3']:
port = self.nextP23Port
self.nextP23Port = '3' if (self.nextP23Port == '2') else '2'
elif issueSlot % 2 == 0:
port = minPort
else:
remApplicablePortUsages = [(p, u) for p, u in applicablePortUsages if p != minPort]
min2Port, min2PortUsage = min(remApplicablePortUsages, key=lambda x: (x[1], -int(x[0], 16))) # port with second smallest usage so far
if min2PortUsage >= minPortUsage + 3:
port = minPort
else:
port = min2Port
Note the self.portUsageAtStartOfCycle
; the naming implies that uops don't interact "horizontally" within an issue group for scheduling decisions, at least in their model of it.
There seems to be some logic about odd vs. even issue-slot numbers within the group of 4 uops that can be issued every clock cycle, the elif issueSlot % 2 == 0:
where they just use the least-occupied port for uops 0 and 2, but second-least-occupied(?) for uops 1 and 3, unless the queue lengths differ by more than 3 uops. (I don't know Python well, and I only skimmed the code, but the naming and comments are helpful, and this seems like a plausible algorithm for real CPUs to use.)
I assume they (Andreas Abel and Jan Reineke) tried to confirm this model with performance experiments as much as possible. https://uops.info/uiCA.html links to their paper uiCA: Accurate Throughput Prediction of
Basic Blocks on Recent Intel Microarchitectures where they write:
While it is known from prior work [8] which ports a µop may be
assigned to, it has been unknown how the renamer chooses the
actual ports at runtime.
We have reverse engineered the port assignment algorithm. In
the following, we describe our findings for CPUs with eight ports;
such CPUs are currently most widely used. These CPUs can issue
up to four µops per cycle.
In the following, we will call the position of a µop within a cycle
an issue slot; e.g., the oldest instruction issued in a cycle would
occupy issue slot 0.
The port that a µop is assigned depends on its issue slot and on
the ports assigned to µops that have not been executed and were
issued in a previous cycle. In the following, we will only consider
µops that can use more than one port.
For a given µop 𝑚, let 𝑃𝑚𝑖𝑛 be the port to which the fewest nonexecuted µops have been assigned to from among the ports that 𝑚
can use. Let 𝑃𝑚𝑖𝑛′ be the port with the second smallest usage so
far. If there is a tie among the ports with the smallest (or second
smallest, respectively) usage, let 𝑃𝑚𝑖𝑛 (or 𝑃𝑚𝑖𝑛′) be the port with the
highest port number from among these ports (the reason for this
choice is probably that ports with higher numbers are connected
to fewer functional units). If the difference in the usage between
𝑃𝑚𝑖𝑛 and 𝑃𝑚𝑖𝑛′ is greater or equal to 3, we set 𝑃𝑚𝑖𝑛′ to 𝑃𝑚𝑖𝑛. The
µops in issue slots 0 and 2 are assigned to port 𝑃𝑚𝑖𝑛. The µops in
issue slots 1 and 3 are assigned to port 𝑃𝑚𝑖𝑛′.
A special case are µops that can use port 2 and port 3. These
ports are used by µops that handle memory accesses, and both ports
are connected to the same types of functional units. For such µops,
the port assignment algorithm alternates between port 2 and port 3.
So I should have just read their paper; that's exactly what I figured out from looking at their code. :P Unfortunately their paper doesn't seem to describe what microbenchmark experiments they used to validate this model against real hardware behaviour.
As for why the HW would do that, consider a case where four vpaddq
uops would be issued in the same cycle. Without looking at across uops in the same group (which would be combinatorially expensive in number of logic gates and power to do in parallel), having half the uops pick the 2nd-emptiest port will tend to distribute them evenly between p1 and p5 even when both queues are equal.
Their paper does report good agreement between real hardware and uiCA, much better than other analyzers like IACA or LLVM-MCA. (Typically within 1% mean average error, vs. others often in the 10 to 20% range.)