Latency bounds and throughput bounds for processors for operations that must occur in sequence

Asked 26/7, 2020 at 2:1 Answered 22/7, 2021 at 12:31

Solved performance cpu-architecture micro-optimization

My textbook (Computer Systems: A programmer's perspective) states that a latency bound is encountered when a series of operations must be performed in strict sequence, while a throughput bound characterizes the raw computing capacity of the processor's functional units.

Questions 5.5 and 5.6 of the textbook introduce these two possible loop structures for polynomial computation

double result = a[0];
double xpwr = x;
for (int i = 1; i <= degree; i++) {
    result += a[i] * xpwr;
    xpwr = x * xpwr;
}

and

double result = a[degree];
double xpwr = x;
for (int i = degree - 1; i >= 0; i--) {
    result = a[i] + x * result;
}

The loops are assumed to be executed on a microarchitecture with the following execution units:

One floating-point adder. It's latency of 3 cycles and is fully pipelined.
Two floating-pointer multipliers. The latency of each is 5 cycles and both are fully pipelined.
Four integer ALUs, each has a latency of one cycle.

The latency bounds for floating point multiplication and addition given for this problem are 5.0 and 3.0 respectively. According to the answer key, the overall loop latency for the first loop is 5.0 cycles per element and the second is 8.0 cycles per element. I don't understand why the first loop isn't also 8.0.

It seems as though a[i] must be multiplied by xpwr before adding a[i] to this product to produce the next value of result. Could somebody please explain this to me?

Kleptomania answered 26/7, 2020 at 2:1 Comment(4)

@Hadi: ok, I figured the = instead of += was probably a transcription error, thanks for cleaning up this question. Also, this is apparently Intel Haswell specifically, or a simplified model of it. Everything matches that. (Although the answer is still the same for Sandybridge; we don't have any need to start more than 1 mulsd per clock, or more than 3 ALU ops per cycle.) – Peruzzi 28/7, 2020 at 2:12

@PeterCordes In the first loop, two mulsd can be dispatched in the same cycle. I mentioned the ALUs to clearly show that the loop trip count additions (which form their own dep chain) are not on the critical path. BTW, refer to Exercise 10 of mathe.tu-freiberg.de/~ernst/Lehre/HPC/tutorials/… if you want to see the full question. It's almost an identical copy from the book. – Funda 28/7, 2020 at 2:16

An earlier duplicate: Determine the critical path in the data flow same book problem, same question of whether it's actually 8 cycles per iteration. IMO, my answer here is clearer, with a correct diagram. – Peruzzi 13/5, 2022 at 12:5

Clock cycles required for multiplication and addition operations has images of the full text of the book's practice problem – Peruzzi 8/10, 2022 at 19:17

Terminology: you can say a loop is "bound on latency", but when analyzing that bottleneck I wouldn't say "the latency bound" or "bounds". That sounds wrong to me. The thing you're measuring (or calculating via static performance analysis) is the latency or length of the critical path, or the length of the loop-carried dependency chain. (The critical path is the latency chain that's longest, and is the one responsible for the CPU stalling if it's longer than out-of-order exec can hide.)

The key point is that out-of-order execution only cares about true dependencies, and allows operations to execute in parallel otherwise. The CPU can start a new multiply and a new add every cycle. (Assuming from the latency numbers that it's Intel Sandybridge or Haswell, or similar¹. i.e. assume the FPU is fully pipelined.)

The two loop-carried dependency chains in the first loop are xpwr *= x and result += stuff, each reading from their own previous iteration, but not coupled to each other in a way that would add their latencies. They can overlap.

result += a[i] * xpwr has 3 inputs:

result from the previous iteration.
a[i] is assumed to be ready as early as you want it.
xpwr is from the previous iteration. And more importantly, that previous iteration could start computing xpwr right away, not waiting for the previous result.

So you have 2 dependency chains, one reading from the other. The addition dep chain has lower latency per step so it just ends up waiting for the multiplication dep chain.

The a[i] * xpwr "forks off" from the xpwr dep chain, independent of it after reading its input. Each computation of that expression is independent of the previous computation of it. It depends on a later xpwr, so the start of each a[i] * xpwr is limited only by the xpwr dependency chain having that value ready.

The loads and integer loop overhead (getting load addresses ready) can be executed far ahead by out-of-order execution.

Graph of the dependency pattern across iterations

mulsd (x86-64 multiply Scalar Double) is for the xpwr updates, addsd for the result updates. The a[i] * xpwr; multiplication is not shown because it's independent work each iteration. It skews the adds later by a fixed amount, but we're assuming there's enough FP throughput to get that done without resource conflicts for the critical path.

mulsd   addsd         # first iteration result += stuff
 |   \    |           # first iteration xpwr   *= x can start at the same time
 v    \   v
mulsd   addsd
 |   \    |
 v    \   v
mulsd   addsd
 |   \    |
 v    \   v
mulsd   addsd

(Last xpwr mulsd result is unused, compiler could peel the final iteration and optimize it away.)

Second loop, Horner's method - fast with FMA, else 8 cycles

result = a[i] + x * result; is fewer math operations, but there we do have a loop-carried critical path of 8 cycles. The next mulsd can't start until the addsd is also done. This is bad for long chains (high-degree polynomials), although out-of-order exec can often hide the latency for small degrees, like 5 or 6 coefficients.

This really shines when you have FMA available: each iteration becomes one Fused Multiply-Add. On real Haswell CPUs, FMA has the same 5-cycle latency as an FP multiply, so the loop runs at one iteration per 5 cycles, with less extra latency at the tail to get the final result.

Real world high-performance code often uses this strategy for short polynomials when optimizing for machines with FMA, for high throughput evaluating the same polynomial for many different inputs. (So the instruction-level parallelism is across separate evaluations of the polynomial, rather than trying to create any within one by spending more operations.)

Footnote 1: similarity to real hardware

Having two FP multipliers with those latencies matches Haswell with SSE2/AVX math, although in actual Haswell the FP adder is on the same port as a multiplier, so you can't start all 3 operations in one cycle. FP execution units share ports with the 4 integer ALUs, too, but Sandybridge/Haswell's front-end is only 4 uops wide so it's typically not much of a bottleneck.

See David Kanter's deep dive on Haswell with nice diagrams, and https://agner.org/optimize/, and other performance resources in the x86 tag wiki)

On Broadwell, the next generation after Haswell, FP mul latency improved to 3 cycles. Still 2/clock throughput, with FP add/sub still being 3c and FMA 5c. So the loop with more operations would be faster there, even compared to an FMA implementation of Horner's method.

On Skylake, all FP operations are the same 4-cycle latency, all running on the two FMA units with 2/clock throughput for FP add/sub as well. More recently, Alder Lake re-introduced lower latency FP addition (3 cycles vs. 4 for mul/fma but still keeping the 2/clock throughput), since real-world code often does stuff like naively summing an array, and strict FP semantics don't let compilers optimize it to multiple accumulators. So on recent x86, there's nothing to gain by avoiding FMA if you would still have a multiply dep chain, not just add.

Also related:

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? more general analysis needs to consider other possible bottlenecks: front-end uop throughput, and back-end contention for execution units. Dep chains, especially loop-carried ones, are the 3rd major possible bottleneck (other than stalls like cache misses and branch misses.)
How many CPU cycles are needed for each assembly instruction? - another basic intro to these concepts
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths - ability of out-of-order exec to overlap dep chains is limited when they're too long.
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) - FP dep chains in parallel with unrolling when possible, e.g. a dot product.

In this case, for large-degree polynomials you could be doing stuff like x2 = x*x / x4 = x2 * x2, and maybe generate x^(2n) and x^(2n+1) in parallel. As in Estrin's scheme used in Agner Fog's vectorclass library for short polynomials. I found that when short polynomials are part of independent work across loop iterations (e.g. as part of log(arr[i])), straight Horner's Rule was better throughput, as out-of-order exec was able to hide the latency of a chain of 5 or 6 FMAs interleaved with other work.

Peruzzi answered 26/7, 2020 at 2:49 Comment(0)

For 5.5 , there are 3 parallel lines:

xpwr = x * xpwr; which has 5 cycle latency. Occurs in iteration #i
a[i] * xpwr; which has 5 cycle latency, but isn't on the critical path of a loop-carried dependency. Occurs in iteration #i.
result + (2); which has 3 cycle latency. Occurs in iteration #i+1 but for the iter #i result

Update

Based on clarifications by @peter

To understand 'loop-carried' dep: means current loop(i) depends on other loops(say , i-1): so we can see xpwr = x * xpwr; as xpwr(i) = x * xpwr(i-1); . consequently form a path( but not known yet if it's critical path)
a[i] * xpwr , could be seen as a byproduct of step 1. So called "forked off from step 1". which also takes 5 cycles.
Upon step 2 finished, result += ... starts for loop i . which takes 3 cycles. it dependent on step 1 , consequently, step 3 is also a 'loop carried' dep, so could be candidates of "critical path".
Since step 3 is 3-cycle < 5 cycle, so step 1 becomes critical path.
What if step 3 ( assuming ) takes 10 cycles . Then to my understanding step 3 becomes critical path.

Attached the diagram as below:

Danieldaniela answered 22/7, 2021 at 12:31 Comment(3)

a[i] * xpwr isn't loop-carried (so it doesn't form a "line"), it forks off from the xpwr *= x dependency chain. Its 5-cycle latency is how far behind the result += ... dep chain is. – Peruzzi 22/7, 2021 at 19:6

Thanks @PeterCordes , I make an update based on your clarification. – Danieldaniela 23/7, 2021 at 3:31

thanks again @PeterCordes :) for all the details you help demonstrate for me. – Danieldaniela 23/7, 2021 at 5:19

Graph of the dependency pattern across iterations

Second loop, Horner's method - fast with FMA, else 8 cycles

Footnote 1: similarity to real hardware

Update

Recommended topics

Hot tags