How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?
Asked Answered
H

4

11

I am on the hook to analyze some "timing channels" of some x86 binary code. I am posting one question to comprehend the bsf/bsr opcodes.

So high-levelly, these two opcodes can be modeled as a "loop", which counts the leading and trailing zeros of a given operand. The x86 manual has a good formalization of these opcodes, something like the following:

IF SRC = 0
  THEN
    ZF ← 1;
    DEST is undefined;
  ELSE
    ZF ← 0;
    temp ← OperandSize – 1;
    WHILE Bit(SRC, temp) = 0
    DO
      temp ← temp - 1;
    OD;
    DEST ← temp;
FI;

But to my suprise, bsf/bsr instructions seem to have fixed cpu cycles. According to some documents I found here: https://gmplib.org/~tege/x86-timing.pdf, seems that they always take 8 CPU cycles to finish.

So here are my questions:

  1. I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.

  2. Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?

Hemeralopia answered 4/2, 2019 at 2:46 Comment(1)
The Intel manual pseudocode is not a strict implementation; the CPU is free to implement however it wants so long as the end result is the same.Vernacularize
G
20

BSF/BSR performance is not data dependent on any modern CPUs. See https://agner.org/optimize/, https://uops.info/, or http://instlatx64.atw.hu/ for experimental timing results, as well as the https://gmplib.org/~tege/x86-timing.pdf you found.

On modern Intel, they decode to 1 uop with 3 cycle latency and 1/clock throughput, running only on port 1. Ryzen also runs them with 3c latency for BSF, 4c latency for BSR, but multiple uops. Earlier AMD is sometimes even slower.

(Prefer rep bsf aka tzcnt in code that might run on AMD CPUs, if you don't need the FLAGS difference between bsf and tzcnt for zero inputs. lzcnt and tzcnt are fast on AMD as well, like 1 cycle latency with 3/clock throughput for lzcnt on Zen 2 (https://uops.info/). Unfortunately lzcnt and bsr aren't compatible that way, so you can't use it in an "optimistic" forward-compatible way, you have to know which you're getting.)

Your "8 cycle" (latency and throughput) cost appears to be for 32-bit BSF on AMD K8, from Granlund's table that you linked. Agner Fog's table agrees, (and shows it decodes to 21 uops instead of having a dedicated bit-scan execution unit. But the microcoded implementation is presumably still branchless and not data-dependent). No clue why you picked that number; K8 doesn't have SMT / Hyperthreading so the opportunity for an ALU-timing side channel is much reduced.


Do note that they have an output dependency on the destination register, which they leave unmodified if the input was zero. AMD documents this behaviour, Intel implements it in hardware but documents it as an "undefined" result, so unfortunately compilers won't take advantage of it and human programmers maybe should be cautious. IDK if some ancient 32-bit only CPU had different behaviour, or if Intel is planning to ever change (doubtful!), but I wish Intel would document the behaviour at least for 64-bit mode (which excludes any older CPUs).

lzcnt/tzcnt and popcnt on Intel CPUs (but not AMD) have the same output dependency before Skylake and before Cannon Lake (respectively), even though architecturally the result is well-defined for all inputs. They all use the same execution unit. (How is POPCNT implemented in hardware?). AMD Bulldozer/Ryzen builds their bit-scan execution unit without the output dependency baked in, so BSF/BSR are slower than LZCNT/TZCNT (multiple uops to handle the input=0 case, and probably also setting ZF according to the input, not the result).

(Taking advantage of that with intrinsics isn't possible; not even with MSVC's _BitScanReverse64 which uses a by-reference output arg that you could set first. MSVC doesn't respect the previous value and assumes it's output-only. VS: unexpected optimization behavior with _BitScanReverse64 intrinsic)


The pseudocode in the manual is not the implementation

(i.e. it's not necessarily how hardware or microcode works).

It gives precisely the same result in all cases, so you can use it to understand exactly what will happen for any corner cases the text leaves you wondering about. That is all.

The point is to be simple and easy to understand, and that means modeling things in terms of simple 2-input operations which happen serially. C / Fortran / typical pseudocode doesn't have operators for many-input AND, OR, or XOR, but you can build that in hardware up to a point (limited by fan-in, the opposite of fan-out).

Integer addition can be modelled as bit-serial ripple carry, but that's not how it's implemented! Instead, we get single-cycle latency for 64-bit addition with far fewer than 64 gate delays using tricks like carry lookahead adders.


The actual implementation techniques used in Intel's bit-scan / popcnt execution unit are described in US Patent US8214414 B2.

Abstract

A merged datapath for PopCount and BitScan is described. A hardware circuit includes a compressor tree utilized for a PopCount function, which is reused by a BitScan function (e.g., bit scan forward (BSF) or bit scan reverse (BSR)).

Selector logic enables the compressor tree to operate on an input word for the PopCount or BitScan operation, based on a microprocessor instruction. The input word is encoded if a BitScan operation is selected.

The compressor tree receives the input word, operates on the bits as though all bits have same level of significance (e.g., for an N-bit input word, the input word is treated as N one-bit inputs). The result of the compressor tree circuit is a binary value representing a number related to the operation performed (the number of set bits for PopCount, or the bit position of the first set bit encountered by scanning the input word).

It's fairly safe to assume that Intel's actual silicon works similarly to this. Other Intel patents for things like out-of-order machinery (ROB, RS) do tend to match up with performance experiments we can perform.

AMD may do something different, but regardless we know from performance experiments that it's not data-dependent.


It's well known that fixed latency is a hugely beneficial thing for out-of-order scheduling, so it's very surprising when instructions don't have fixed latency. Sandybridge even went so far as to standardize uop latencies to simplify the scheduler and reduce the opportunities write-back conflicts. (e.g. a 3-cycle latency uop followed by a 2-cycle latency uop to the same port would produce 2 results in the same cycle). This meant making complex-LEA (with all 3 components: [disp + base + idx*scale]) take 3 cycles instead of just 2 for the 2 additions like on previous CPUs. There are no 2-cycle latency uops on Sandybridge-family. (There are some 2-cycle latency instructions, because they decode to 2 uops with 1c latency each. The scheduler schedules uops, not instructions).

One of the few exceptions to the rule of fixed latency for ALU uops is division / sqrt, which uses a not-fully-pipelined execution unit. Division is inherently iterative, unlike multiplication where you can make wide hardware that does the partial products and partial additions in parallel.

On Intel CPUs, variable-latency for L1d cache access can produce replays of dependent uops if the data wasn't ready when the scheduler optimistically hoped it would be.

Gassaway answered 4/2, 2019 at 3:20 Comment(1)
According to this, both bsf/bsr had variable latencies in the IA-32 Execution Layer, which is basically a software emulator for x86 on Itanium, using an inefficient sequence of Itanium instructions. The patent itself proposes a number of fast, fixed latency algorithms (some of which make use of the compute zero index (czx) instruction on Itanium.Suppurate
V
5

The 80x86 manual has a good description of the expected behavior, but that has nothing to do with how it's actually implemented in silicon in any model from any manufacturer.

Let's say that there's been 50 different CPU designs from Intel, 25 CPU designs from AMD, then 25 more from other manufacturers (VIA, Cyrix, SiS/Vortex, NSC, ...). Out of those 100 different CPU designs, maybe there's 20 completely different ways that BSF has been implemented, and maybe 10 of them have fixed timing, 5 have timing that depends on every bit of the source operand, and 5 depend on groups of bits of the source operand (e.g. maybe like "if highest 32 bits of 64-bit operand are zeros { switch to 32-bit logic that's 2 cycles faster }").

I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.

You can't. More specifically, you can test or research existing CPUs, but that's a waste of time because next week Intel (or AMD or VIA or someone else) can release a new CPU that has completely different timing.

As soon as you rely on "measured from existing CPUs" you're doing it wrong. You have to rely on "architectural guarantees" that apply to all future CPUs. There is no "architectural guarantee". You have to assume that there may be a timing side-channel (even if there isn't for current CPUs)

Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?

Instead of doing a 64-bit BSF, why not split it into a pair of 32-bit pieces and do them in parallel, then merge the results? Why not split it into eight 8-bit pieces? Why not use a table lookup for each 8-bit piece?

Vaporing answered 4/2, 2019 at 4:4 Comment(2)
It's theoretically possible for instructions to become data-dependent on future CPUs, but it's exceedingly unlikely for real hardware unless something completely fundamental changes about how CPUs are designed. Hadi commented that Itanium's emulation layer had variable-latency bsf/bsr, but I think emulation is the only plausible case. (That might include Transmeta's Crusoe, where longer-range internal JIT optimization could optimize for a case with a known input range, or a constant value.)Gassaway
But yes, +1 for looking for architectural guarantees, like I think AES-NI provides for AESENC / AESDEC. Of course you won't find any for "normal" instructions, so all we can really say is that you can't guarantee this in a fully future-proof way, mostly because emulators.Gassaway
H
3

The answers posted have explained well that the implementation is different from pseudocode. But if you are still curious why the latency is fixed and not data dependent or uses any loops for that matter, you need to see electronic side of things. One way you could implement this feature in hardware is by using a Priority encoder.

A priority encoder will accept n input lines that can be one or off (0 or 1) and give out the index of the highest priority line that is on. Below is a table from the linked Wikipedia article modified for a most significant set bit function.

input |  output  index of first set bit 
0000  |  xx      undefined
0001  |  00      0
001x  |  01      1
01xx  |  10      2
1xxx  |  11      3

x denotes the bit value does not matter and can be anything

If you see the circuit diagram on the article, there are no loops of any kind, it is all parallel.

Hermelindahermeneutic answered 23/3, 2019 at 20:10 Comment(0)
C
1

Obviously it is faster using custom hardware, yet it's possible even in software. This implementation has a data independent latency of 16 cycles assuming that the ops used have a latency of 1 cycle.

size_t bsr(uint32_t v) {
    size_t r = 0, t;
    t = (0 != (v >> 16)) << 4; v >>= t; r |= t;
    t = (0 != (v >>  8)) << 3; v >>= t; r |= t;
    t = (0 != (v >>  4)) << 2; v >>= t; r |= t;
    return r + ((0xffffaa50 >> (v << 1)) & 3);
}

The hardware likely implements an asynchronous logic circuit that takes under 3 cycles to settle in the worst case. Which also explains why it is a one micro-op that executes only in the pipe that has the hardware.

Carburetor answered 6/5 at 20:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.