Does it cost significant resources for a modern CPU to keep flags updated?
Asked Answered
P

1

5

As I understand it, on a modern out of order CPU, one of the most expensive things is state, because that state has to be tracked in multiple versions, kept up-to-date across many instructions etc.

Some instruction sets like x86 and ARM make extensive use of flags, which were introduced when the cost model was not what what it is today, and the flags only cost a few logic gates. Things like every arithmetic instruction setting flags to detect zero, carry and overflow.

Are these particularly expensive to keep updated on a modern out of order implementation? Such that e.g. an ADD instruction updates the carry flag, and this must be tracked because although it will probably never be used, it is possible that some other instruction could use it N instructions later, with no fixed upper bound on N?

Are integer operations like addition and subtraction cheaper on instruction set architectures like MIPS that do not have these flags?

Premier answered 13/11, 2020 at 8:23 Comment(0)
M
6

Various aspects of this are not very publicly known, so I will try to separate definitely known things from reasonable guesses and conjecture.

An approach has been to extend the (physical) integer registers (whether they take the form of a physical register file [eg P4 and SandyBridge+] or of results-in-ROB [eg P3]) with the flags that were produced by the operation that also produced the associated integer result. That's only about the arithmetic flags (sometimes AFLAGS, not to be confused with EFLAGS), but I don't think the "weird flags" are the focus of this question. Interestingly there is a patent[1] that hints at storing more than just the 6 AFLAGS themselves, putting some "combination flags" in there as well, but who know whether that was really done - most sources say the registers are extended by 6 bits, but AFAIK we (the public) don't really know. Lumping the integer result and associated flags together is described in for example this patent[2], which is primarily about preventing a certain situation where the flags might accidentally no longer be backed by any physical register. Aside from such quirks, during normal operation it has the nice effect of only needing to allocate 1 register for an arithmetic operation, rather than a separate main-result and flags-result, so renaming is normally not made much worse by the existence of the flags. Additionally, either the register alias table needs at least one more slot to keep track of which integer register contains the latest flags, or a separate flag-renaming-state buffer keeps track of the latest speculative flag state ([2] suggests Intel chose to separate them, which may simplify the main RAT but they don't go into such details). More slots may be used[3] to efficiently implement instructions which only update a subset of the flags (NetBurst™ famously lacked this, resulting in the now-stale advice to favour add over inc). Similarly, the non-speculative architectural state (whether it would be part of the retirement register file or be separate-but-similar again is not clear) needs at least one such slot.

A separate issue is computing the flags in the first place. [1] suggests separating flag generation from the main ALU simplifies the design. It's not clear to what degree they would be separated: the main ALU has to compute the Adjust and Sign flags anyway, and having an adder output a carry out the top is not much to ask (less than recomputing it from nothing). The overflow flag only takes an extra XOR gate to combine the carry into the top bit with the carry out of the top bit. The Zero flag and Parity flag are not for free though (and they depend on the result, not on the calculation of the result), if there is partial separation it would make sense that those would be computed separately. Perhaps it really is all separate. In NetBurst™, flag calculation took an extra half-cycle (the ALU was double-pumped and staggered)[4], but whether that means all flags are computed separately or a subset of them (or even a superset as [1] hinted) is not clear - the flags result is treated as monolithic so latency tests cannot distinguish whether a flag is computed in the third half-cycle by the flags unit or just handed to the flags unit by the ALU. In any case, typical ALU operations could be executed back-to-back, even if dependent (meaning that the high half of the first operation and the low half of the second operation ran in parallel), the delayed computation of the flags did not stand in the way of that. As you might expect though, ADC and SBB were not so efficient on NetBurst, but there may be other reasons for that too (for some reason a lot of µops are involved).

Overall I would conclude that the existence of arithmetic flags costs significant engineering resources to prevent them from having a significant performance impact, but that effort is also effective, so a significant impact is avoided.

Mckenziemckeon answered 14/11, 2020 at 10:56 Comment(5)
Sign flag depends only on the result, not the calculation. It's just the MSB for most instructions. However, ZF is sometimes weird. For BSR/BSF (which Intel CPUs run as a single uop), ZF depends on the calculation (actually the input operand), not the output. AMD runs them as several uops, perhaps for that reason.Ranchero
Re: renaming CF separately from the SPAZO flag group: Skylake and later never have flag-merging uops, just reading the two parts as separate inputs if necessary (jbe or whatever). uops can have up to 3 inputs, so cmovbe is unfortunately 2 uops, unlike most other cmov instructions which are 1 uop. (2 integer inputs and 1 part of FLAGS). See @Bee's answer on What is a Partial Flag Stall?. So inc/dec is fully efficient even in an adc loop, unlike P6-family stalls, and earlier SnB merging uops which were still pretty cheap.Ranchero
I guess part of the question would be how much power it takes to run the FLAG-renaming logic. vs. the amount of extra instructions (and associated power to run them) that would be needed in an ISA without flags. In x86 specifically, or in a well-designed (pipelined RISC friendly) ISA with FLAGS, like PowerPC or AArch64. PowerPC as usual complicates everything by having 8 (IIRC) FLAGS slots in its condition register, allowing multiple flags results to be live at once, and allowing software pipelining of stuff using flags. (Some instructions like cmp take a flag-source or flag-dst arg)Ranchero
@PeterCordes power would be interesting to compare but I don't have that dataMckenziemckeon
Nor do I. But that (and die area) are where the performance cost lies (given enough engineering hours to find near-optimal solutions). You mention some area costs, like the extra 6 bits per PRF entry, so probably you could sneak in a mention that this must cost at least some power. That cuts into the power budget for turboing higher / more, and maybe slightly for sustained clocks at max TDP. (Although max-TDP at baseline clock frequency on x86 CPUs involves SIMD FMA units, so FLAGs renaming is probably even less significant there than when turboing on lighter-weight code.)Ranchero

© 2022 - 2024 — McMap. All rights reserved.