What is a Partial Flag Stall?
Asked Answered
L

2

14

I was just going over this answer by Peter Cordes and he says,

Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead. Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.

I don't feel like I understand yet what a "partial flag stall" is. How do I know one has occurred? What triggers the event other than sometimes when flags are read? What does it mean to merge flags? In what condition are "some of the flags written" but a partial-flag merge doesn't happen? What do I need to know about flag stalls to understand them?

Leung answered 16/4, 2018 at 23:21 Comment(4)
Peter Cordes and others probably have a more comprehensive explanation but, the way I understand it, flag bits are renamed separately in register renaming. For the instructions that set all flag bit, which is the majority, the state of all those "registers" can be reset all at once, but for instructions that only affect a sub-set of the flag bits, the actual flag values need to be merged from the current instruction as well as the last one that set the remaining flag bits, if that makes sense. This merging (sometimes) takes extra time.Copy
My mental model was just that the instruction operated on a global flag register in serial? Is that not true? Look forward to Peter's answer if he buzzes in.Leung
@EvanCarroll: EFLAGS is renamed of course. How could add have 4 per clock throughput if you didn't break the WAW hazard? (And yes, different groups of flags are renamed separately, so inc can also have 4 per clock throughput and no input dependency on FLAGS, like how some Intel CPUs can rename ah separately from al when they're written separately.) Working on an answer, but see Agner Fog's microarch guide: agner.org/optimize. He explains partial-flag stalls and merges.Scorpaenid
I'm going to shut up and await the answer. I won't lie to having Amazon-d your name a few times. Just take my money in the event you ever put out a book on x86, Linux, or Radare.Leung
A
12

Generally speaking a partial flag stall occurs when a flag-consuming instruction reads one or more flags that were not written by the most recent flag-setting instruction.

So an instruction like inc that sets only some flags (it doesn't set CF) doesn't inherently cause a partial stall, but will cause a stall if a subsequent instruction reads the flag (CF) that was not set by inc (without any intervening instruction that sets the CF flag). This also implies that instructions that write all interesting flags are never involved in partial stalls since when they are the most recent flag setting instruction at the point a flag reading instruction is executed, they must have written the consumed flag.

So, in general, an algorithm for statically determining whether a partial flags stall will occur is to look at each instruction that uses the flags (generally the jcc family and cmovcc and a few specialized instructions like adc) and then walk backwards to find the first instruction that sets any flag and check if it sets all of the flags read by the consuming instruction. If not, a partial flags stall will occur.

Later architectures, starting with Sandy Bridge, don't suffer a partial flags stall per se, but still suffer a penalty in the form of an additional uop added to the front-end by the instruction in some cases. The rules are slightly different and apply to a narrower set of cases compared to the stall discussed above. In particular, the so-calling flag merging uop is added only when a flag consuming instruction reads from multiple flags and those flags were last set by different instructions. This means, for example, that instructions that examine a single flag never cause a merging uop to be emitted.

Starting from Skylake (and probably starting from Broadwell), I find no evidence of any merging uops. Instead, the uop format has been extended to take up to 3 inputs, meaning that the separately renamed carry flag and the renamed-together SPAZO group flags can both be used as inputs to most instructions. Exceptions include instructions like cmovbe which has two register inputs, and whose condition be requires the use of both the C flag and one or more of the SPAZO flags. Most conditional moves use only one or the other of C and SPAZO flags, however, and take one uop.

Examples

Here are some examples. We discuss both "[partial flag] stalls" and "merge uops", but as above only at most one of the two applies to any given architecture, so something like "The following causes a stall and a merge uop to be emitted" should be read as "The following causes a stall [on those older architectures which have partial flag stalls] or a merge uop [on those newer architectures which use merge uops instead]".

Stall and merging uop

The following example will cause a stall and merging uop to be emitted on Sandy Bridge and Ivy Bridge, but not on Skylake:

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
ja  label    ; reads CF and ZF

The ja instruction reads CF and ZF which were last set by the add and inc instructions, respectively, so a merge uop is inserted to unify the separately set flags for consumption by ja. On architectures that stall, a stall occurs because ja reads from CF which was not set by the most recent flag setting instruction.

Stall only

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
jc  label    ; reads CF

This causes a stall because as in the prior example CF is read which is not set by the last flag setting instruction (here inc). In this case, the stall could be avoided by simply swapping the order of the inc and add since they are independent and then the jc would read only from the most recent flag setting operation. There is no merge uop needed because the flags read (only CF) all come from the same add instruction.

Note: This case is under debate (see the comments) - but I cannot test it because I don't find evidence of any merging ops at all on my Skylake.

No stall or merging uop

add rbx, 5   ; sets CF, ZF, others
inc rax      ; sets ZF, but not CF
jnz  label   ; reads ZF

Here there is no stall or merging uop needed, even though the last instruction (inc) only sets some flags, because the consuming jnz only reads (a subset of) flags set by the inc and no others. So this common looping idiom (usually with dec instead of inc) doesn't inherently cause a problem.

Here's another example that doesn't cause any stall or merge uop:

inc rax      ; sets ZF, but not CF
add rbx, 5   ; sets CF, ZF, others
ja  label    ; reads CF and ZF

Here the ja does read both CF and ZF and an inc is present which doesn't set ZF (i.e., a partial flag writing instruction), but there is no problem because the add comes after the inc and writes all the relevant flags.

Shifts

The shift instructions sar,shr and shl in both their variable and fixed count forms behave differently (generally worse) than described above and this varies a fair amount across architectures. This is probably due to their weird and inconsistent flag handling1. For example, on many architectures there is something like a partial flags stall when reading any flag after a shift instruction with a count other than 1. Even on the most recent architectures variable shifts have a significant cost of 3 uops due to flag handling (but there is no more "stall").

I'm not going to include all the gory details here, but I'd recommend looking for the word shift in Agner's microarch doc if you want all the details.

Some rotate instructions also have interesting flag related behavior in some cases similar to shifts.


1 For example, setting different subsets of flags depending on whether the shift count is 0, 1 or some other value.

Avernus answered 17/4, 2018 at 0:41 Comment(18)
I think your "stall only" example still produces a merging uop on Intel CPUs. I think if you were designing a CPU which could tell the difference between reading only flags from one older insn vs. a mix of writers, it would be able to read ZF from the separately-renamed group of flags that includes ZF without stalling or merging. Like how Intel CPUs can runs inc al and inc ah in parallel without triggering a merge of EAX or a stall. But for flags, Intel just punts to the merge case for anything that doesn't take the fast path.Scorpaenid
@PeterCordes - weird, I wrote some tests but I can't see evidence of the extra merging op in any of those cases. I would expect 1 to have a merging uop, 2 is the case under discussion, and 3 I would expect never to have a merging uop, but I always see 3 uops total for each triplet of inc,add,jcc for all the performance counters I checked, and the performance is the same for all variants. I assumed these uops would show up in the perf counters? Skylake.Avernus
@PeterCordes - see this thread: it seems that the actual occurrence of merging uops is perhaps much less that previously believed, at least on Skylake but perhaps on earlier architectures too (I just don't have them to test on). See this thread - what seems to have happened is that the extra uop is actually due to the lack of macro-fusion, and so in many case where there is no extra merging uop (but there still is an extra uop). I haven't investigated much beyond this, but it's entirely possible that inc never results in a merging uop.Avernus
I think that one thing that's missing in your answer is why we got here in the first place. The fact is that in the old days instructions were executed one at a time and we did not have such problems. The FLAGS register would always be set as expected. Now that we have processors with superscalar capabilities, we got that problem of which instruction is changing the FLAGS register. (And in comparison, RISC processors do not have any FLAGS register in part to avoid such problems.)Cercus
@AlexisWilke - I don't I think I agree here. Superscalar execution has all sorts of complications, but flags can be handled through renaming just like regular registers. In fact, in some sense the problem is easier because there is only 1 flag of each type, rather than 16 or 32 architecturally visible registers. So I believe the root of the problem is not writing all the flags in a single instruction: this combined with the need to rename them efficiently means that you face some tough tradeoffs: e.g., do you rename the flags separately (expensive since you need to ...Avernus
... now possibly pull in 2 or 3 registers for conditions that check 2 or 3 flags), or do you rename them all together, and then take a hit to "merge" them in the case the needed flags weren't written by the same instruction. All of these and variants have apparently been tried: if you search the patent literation for "SPAZO, flags" you find a lot of good stuff. I don't really agree about RISC - I don't think avoidance of flags is a core tenet. The two most successful RISC uarches, ARM and POWER both seem to have flags.Avernus
Holy crap, I'd never noticed that cmovbe and cmova were 2-uop instructions on SKL. It seems the latency from first-operand to destination is still 1 cycle, though. A cmp ebx, 123 / times 6 cmovbe ecx, ebx loop body (loop carried dependency through ECX only) runs at about 1 iter per 6.5 cycles, vs. 6.00 for cmovb or cmovz. I think your conclusion about instructions having separate inputs for the 2 flags sounds likely, e.g. for jbe.Scorpaenid
@PeterCordes - oddly, setbe and friends are 2 uops too. Oddly in the sense they only have one input, so if GP regs and flag regs were fungible, it seems like this could be 1 uop.Avernus
@PeterCordes - makes sense about latency, I expect the first uop to be dealing with flags only, so it wouldn't be part of the critical path in the type of test you did (but it is not like a flag merging uop in that it always happens, and only it can only go to p06). uops.info confirms the latency is only 1 through the second operand, but 2 through the flags. Not sure why they don't have 1 -> 1 there.Avernus
I guess can say uops can take up to two GP inputs, and 1 flag (group), or 2 flag groups (like jbe and the hidden uop before cmovbe and setbe). Or something like that.Avernus
Oh, the 6.5c/iter I saw in my cmovbe test wasn't latency, it was port pressure from cmovbe plus the loop branch (dec/jnz). Yeah, setbe is weird, and surprisingly uops.info reports that it takes 2 uops even on SnB/IvB (uops.info/html-instr/SETBE_NOREX_R8.html), up from 1 in P6-family. In my testing for Haswell/Skylake partial regs, I found setcc ah depends on the old AH, so it seems HSW/SKL define setcc as an RMW instruction, perhaps for better interaction with writing a low8 in the normal use-case? I must have had a single-uop setcc.Scorpaenid
@PeterCordes - I mean your test shows the latency is 1, because the ops are depedendent through ecx, so the latency must be 1 for a 6.5 result. That's what I thought you were getting at. I agree the extra 0.5 comes from p06 pressure.Avernus
It would have been so much better if AMD64 redefined those opcodes to be setcc r32 / m8 (or m32); no false dep or xor-zeroing to get a zero-extended result.Scorpaenid
Yeah, I understood that cmovbe 1->1 latency = 1. And yeah, IDK why uops.info didn't test that. I was just saying that the 6.5 isn't due to the latency dep chain at all.Scorpaenid
@PeterCordes - yeah I can believe they are RMW, but that's still only 1 input reg compared to 2 for cmov, so it's too bad they don't become 1 uop (as they would if you just naively counted "inputs"). I don't find any example of a non-jcc instruction that can take two flags in 1 uop.Avernus
Yeah, agreed, you'd expect 1 uop with 2 flag + 1 integer input to be possible. But perhaps that would be another special case that would require more hardware logic somewhere important (e.g. in uop dependency tracking / the RS). rcl r32,1 is 3 uops, and that "only" needs both halves of flags + 1 reg as input. (And also writes both halves of flags as output.) Of course it's obscure enough that Intel might simply have not optimized it fully. This might be different from the adc al,0 short-form encoding where the HSW decoders fail to use the 1-uop special case for imm=0.Scorpaenid
@Avernus when you say the merging-uop is in the 'front-end' where in the pipeline is it inserted? Any guesses why its different than the partial-register merging uop which AFAIK is inserted in the ROB?Saline
Well I think I just mean it's somewhere in the front-end because that's where the uop stream is generated? That is, I would expect any uop inserted dynamically into the ordered uop stream to occur in the front end, suitably defined, since that's where the stream is generated. It could be at the ROB sure, or maybe more exactly "at rename" which is sort of the dividing line between front-end and backend I think. I don't actually know where it is inserted exactly, sorrry.Avernus
G
1

A flag modifying uop may only update part of the flags register. The RAT has one entry for the flags/eflags/rflags register and a mask showing the flags that are changed by the uop that caused the physical register the entry is pointing to to be assigned. If a series of instructions occur that read and write the same flag, then a separate physical register gets assigned for each write and each read uses the previous physical register. In those registers will be written that flag and all other flags will be clear. That's why the current physical register cannot be used when a read from a different flag that is not in the mask in the flags RAT entry, because it would read a clear bit and not the real state of the flag that has been left behind. On old microarchitectures, a stall occurs until the state of the flags register is valid in the RRF (by waiting for the retirement of each flag setting uop before it to insert the bits they set in the RRF flags register, where each uop is examined to know the architectural registers it uses / flags it changes, which is in an easier format to interpret than x86 macroops).

On microarchitectures that use the PRF scheme (SnB onwards), a merging uop is required to keep a unified flags register when there is no dedicated RRF register, otherwise the retirement RAT would be pointing to a meaningless physical register with only 1 of the flags in. The merging uop occurs after every partial-flags modifying instruction like inc or dec. add modifies all 6 status flags and therefore does not require a merge uop. I think this probably implies that status, control and system flags are renamed separately on the PRF scheme, given that add does not require a merging uop. Apparently the CF flag is renamed differently to the SPAZO cluster.

Partial register stalls are similar. The RAT has 2 entries to represent rax: an entry for al/ax/eax/rax (distinguished by a size indicator in the entry) and ah (both are updated on a write to ax, eax or rax to point to the same register). It only needs 2 to represent because there are only 2 mutually exclusive registers. If a read from eax occurs before a previous write to one of the smaller registers retires, then the allocator stalls (because the ROB entry cannot have 2 dependencies for the same operand) until the full register is present in the RRF, and then it will rename both entries to the RRF register for rax.

In later microarchitectures that use the PRF scheme, this is now difficult because a single RRF for rax is no longer kept. Therefore, a merging uop needs to be used, which also happens to be faster than the stall method of the previous microarchitectures.

merging uop implementations

  1. One implementation of the merging uop could be that it is inserted before every write to a partial flag / register, and the merging uop reads from the full register / flags register before writing it all to a new physical register. The write is then allocated the same register, which results in the write naturally merging itself in. The following read can then read any part of the register / any flag. This basically sets up a dependency chain between every partial-flag writing instruction and a previous flag writing instruction (partial or full) and between every partial register write and a previous (full / partial) write to the register. In this instance, the RAT never has partial renames.

  2. It could be allocated immediately after the write to a partial register. The merge uop takes the previous physical register (which will always be a full rax/eax write, or in the case of flags, a full status flag update, like that which is done by add or the merge uop) and the new physical register and combines them into the new physical register. This would suggest that the allocator inserts it. If it were inserted by the decoder, the allocator could allocate that uop in a different cycle, when the previous RAT pointer is unknown.

  3. It could be allocated immediately before a read that occurs from a register that has an unified state in the RAT. This would imply that the RAT tracks rax/eax separately to ax, al and ah. In this case, the 2 physical registers that need to be merged are taken from the RAT.

The optimisation manual implies it is one of the latter 2 scenarios 'The merging uop occurs after every partial register write' (i.e. a write to ax, al or ah, but not eax).

Gramnegative answered 27/1, 2021 at 7:17 Comment(10)
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent shows that AL / AX aren't renamed separately from RAX in Haswell (or maybe IvB) or later, only AH.Scorpaenid
@PeterCordes in this answer, what I said was that there is one entry that al, eax and rax share, and ah has a separate entryGramnegative
I was talking about your list at the end of possible merge-uop implementations. You talked about the RAT tracking RAX separately from AX, AL, and AH. But HSW simplified that. Before that, mov al, ... did avoid a false dependency on the old value of RAX, so there was some mechanism that could track a separately renamed AL and AH, neither of them having a false dep on RAX. (IIRC, Intel's optimization manual mentions Sandybridge choosing not to rename AL when you're doing an RMW operation anyway, like inc al. But for write-only access, it will rename it separately, I think.)Scorpaenid
@PeterCordes if you read from eax then a previous write to ax/al needs to retire so that a ROB entry can be assigned. RAT will know this because when it goes to rename the eax read, it will see that the current al/ax/eax/rax has a width of 8 or 16 bits, so it stalls until the retire stage of the ax/al write instruction makes the RAT al/ax/eax/rax entry point to the accumulator RRF entry with a 32 bit width. When a write to ah retires it writes to the accumulator RRF register (there is only one) and states that it is now 32 bit in width as there is no instruction yet to retire before it.Gramnegative
And if you go to read eax/ax and ah RAT entry does not have a width indicator of 32/16 bits respectively (i.e. it has a width of 8 bits), then it also stalls until it is eventually made to point to the RRF accumulator entry, which will. For example, on 64 bit RRF, when a write to ah or any a register retires, the RRF accumulator is always 64 bits, and you have to wait until the RAT is made to point to it, which happens automatically when there are no renames in flight before it due to a stallGramnegative
Because, when it retires and nothing else is in flight, that ROB entry no will still be in the RAT, and when a ROB entry retires and there is a RAT entry with that ROB number, it is automatically made to point to the RRF. The width indicator of the RAT entry is also updated to 64 bit in this processGramnegative
I will think a moment how to optimise thisGramnegative
On Sandybridge specifically, for mov al,1 / add eax, 1 we know(?) that an AL-merging uop (into RAX) gets inserted, without having to stall anything. An AH-merging uop seems to have to issue in a cycle by itself even on modern Intel. Intel's optimization manual documents both of those behaviours for Sandybridge specifically. SnB doesn't have an RRF. On Core2/Nehalem, it stalls the front-end for about 3 cycles to insert a merging uop, not indefinitely until RRF update. I don't know how Intel manages to create this behaviour with only 2 RAT entries per register, but that's the reality.Scorpaenid
@PeterCordes yes, that's right. The way the PRF scheme works allows a merging uop like the one I suggested, to be possible, and there isn't a stall. I'm not sure about there being a merging uop on an RRF scheme however -- what was the proof that that happens again?, and also I'm not sure when or whether there would ever be just a stall on a PRF scheme.Gramnegative
The "3 cycle with merging uop" description is from Agner Fog's microarch PDF, an improvement in Core 2 / Nehalem vs. Pentium-M and earlier. He doesn't say how he measured it, but one experiment could be two long dep chains (e.g. imul latency), one in the shadow of the other, with a partial reg-stall in the later / shorter one. If it merges, you won't see an overall increase in cycles per loop iteration, but if it fully stalls until the result is in the RRF that would also have to wait for the other dep chain. (IDK if his "5-6 cycle" stall on older uarches is best-case or serializing.)Scorpaenid

© 2022 - 2024 — McMap. All rights reserved.