x86_64 haswell instruction scheduled on already used port instead of unused one
Asked Answered
F

1

1

given this code

lp:
    vpaddq     ymm0, ymm1
    vpaddq     ymm3, ymm4
    add        rbx, rax
    add        rcx, rax

    vpaddq     ymm1, ymm2
    vpaddq     ymm4, ymm5

    sub     rax, 0xA
    jge     lp

according to the https://uica.uops.info this is the scheduling (haswell microarch)

https://uica.uops.info/tmp/24c42127930f4694af851391006c40f2_trace.html

theoretical maximum for this loop is 2cycles/iteration and it does happen eventually but in the beggining, trace shows that add rcx, rax is scheduled on port 5, port that is still being used by vpaddq ymm0, ymm1, instead of unused port 0 why is that?

Friesland answered 17/6, 2023 at 9:57 Comment(0)
M
1

In general, see How are x86 uops scheduled, exactly? - the hardware can't see (or isn't looking at) future issue groups when allocating uops for this group to the ports with the shortest queues. And maybe is "greedy" even within the current issue group? But none of that explains why none of the four uops get allocated to port 0.

In practice, "cold" startup would only happen after an I-cache miss, interrupt return, or similar thing that drained the RS (scheduler). So it's hard to measure whether real HW behaves the same way uiCA simulates, although perhaps we could get somewhere with lfence to drain the ROB+RS and look at performance counters for uops that come right after it.

Maybe worth checking the uiCA source code to see if there are any comments describing the logic for that case: https://github.com/andreas-abel/uiCA/blob/9cbbe931247f45f756738cf35800b5e8dff7bbb0/uiCA.py#L1127

addNewUops looks like the function that adds an issue-group of new uops to the back end. It iterates the uops of an issue-group in sequence:

   def addNewUops(self, clock, newUops):
      self.portUsageAtStartOfCycle[clock] = dict(self.portUsage)
      portCombinationsInCurCycle = {}
      for issueSlot, fusedUop in enumerate(newUops):
         for uop in fusedUop.getUnfusedUops():
            ...

I think Haswell would be using the elif len(allPorts[self.uArchConfig.name]) == 8: block, since it has ports 0..7, unlike Ice Lake which has execution 10 ports or Sandybridge which only has p0..5.

            elif len(allPorts[self.uArchConfig.name]) == 8:
               applicablePortUsages = [(p,u) for p, u in self.portUsageAtStartOfCycle[clock].items() if p in uop.prop.possiblePorts]
               minPort, minPortUsage = min(applicablePortUsages, key=lambda x: (x[1], -int(x[0], 16))) # port with minimum usage so far

               if uop.prop.possiblePorts == ['2', '3']:
                  port = self.nextP23Port
                  self.nextP23Port = '3' if (self.nextP23Port == '2') else '2'
               elif issueSlot % 2 == 0:
                  port = minPort
               else:
                  remApplicablePortUsages = [(p, u) for p, u in applicablePortUsages if p != minPort]
                  min2Port, min2PortUsage = min(remApplicablePortUsages, key=lambda x: (x[1], -int(x[0], 16))) # port with second smallest usage so far
                  if min2PortUsage >= minPortUsage + 3:
                     port = minPort
                  else:
                     port = min2Port

Note the self.portUsageAtStartOfCycle; the naming implies that uops don't interact "horizontally" within an issue group for scheduling decisions, at least in their model of it.

There seems to be some logic about odd vs. even issue-slot numbers within the group of 4 uops that can be issued every clock cycle, the elif issueSlot % 2 == 0: where they just use the least-occupied port for uops 0 and 2, but second-least-occupied(?) for uops 1 and 3, unless the queue lengths differ by more than 3 uops. (I don't know Python well, and I only skimmed the code, but the naming and comments are helpful, and this seems like a plausible algorithm for real CPUs to use.)


I assume they (Andreas Abel and Jan Reineke) tried to confirm this model with performance experiments as much as possible. https://uops.info/uiCA.html links to their paper uiCA: Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures where they write:

While it is known from prior work [8] which ports a µop may be assigned to, it has been unknown how the renamer chooses the actual ports at runtime.

We have reverse engineered the port assignment algorithm. In the following, we describe our findings for CPUs with eight ports; such CPUs are currently most widely used. These CPUs can issue up to four µops per cycle.

In the following, we will call the position of a µop within a cycle an issue slot; e.g., the oldest instruction issued in a cycle would occupy issue slot 0.

The port that a µop is assigned depends on its issue slot and on the ports assigned to µops that have not been executed and were issued in a previous cycle. In the following, we will only consider µops that can use more than one port.

For a given µop 𝑚, let 𝑃𝑚𝑖𝑛 be the port to which the fewest nonexecuted µops have been assigned to from among the ports that 𝑚 can use. Let 𝑃𝑚𝑖𝑛′ be the port with the second smallest usage so far. If there is a tie among the ports with the smallest (or second smallest, respectively) usage, let 𝑃𝑚𝑖𝑛 (or 𝑃𝑚𝑖𝑛′) be the port with the highest port number from among these ports (the reason for this choice is probably that ports with higher numbers are connected to fewer functional units). If the difference in the usage between 𝑃𝑚𝑖𝑛 and 𝑃𝑚𝑖𝑛′ is greater or equal to 3, we set 𝑃𝑚𝑖𝑛′ to 𝑃𝑚𝑖𝑛. The µops in issue slots 0 and 2 are assigned to port 𝑃𝑚𝑖𝑛. The µops in issue slots 1 and 3 are assigned to port 𝑃𝑚𝑖𝑛′.

A special case are µops that can use port 2 and port 3. These ports are used by µops that handle memory accesses, and both ports are connected to the same types of functional units. For such µops, the port assignment algorithm alternates between port 2 and port 3.

So I should have just read their paper; that's exactly what I figured out from looking at their code. :P Unfortunately their paper doesn't seem to describe what microbenchmark experiments they used to validate this model against real hardware behaviour.

As for why the HW would do that, consider a case where four vpaddq uops would be issued in the same cycle. Without looking at across uops in the same group (which would be combinatorially expensive in number of logic gates and power to do in parallel), having half the uops pick the 2nd-emptiest port will tend to distribute them evenly between p1 and p5 even when both queues are equal.

Their paper does report good agreement between real hardware and uiCA, much better than other analyzers like IACA or LLVM-MCA. (Typically within 1% mean average error, vs. others often in the 10 to 20% range.)

Miserable answered 17/6, 2023 at 23:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.