Why are conditionally executed instructions not present in later ARM instruction sets?
Asked Answered
W

7

32

Naively, conditionally executed instructions seem like a great idea to me.

As I read more about ARM (and ARM-like) instruction sets (Thumb2, Unicore, AArch64) I find that they all lack the bits for conditional execution.

Why is conditional execution missing from each of these?

Was conditional execution a mistake at the time, or have subsequent changes made it an expensive waste of instruction bits?

Waitabit answered 4/3, 2014 at 10:15 Comment(0)
T
37

General claim is modern systems have better branch predictors and compilers are much more advanced so their cost on instruction encoding space is not justified.

This is from ARMv8 Instruction Set Overview

The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.

And it continues

A very small set of “conditional data processing” instructions are provided. These instructions are unconditionally executed but use the condition flags as an extra input to the instruction. This set has been shown to be beneficial in situations where conditional branches predict poorly, or are otherwise inefficient.

Another paper titled Trading Conditional Execution for More Registers on ARM Processors claims:

... conditional execution takes up precious instruction space as conditions are encoded into a 4-bit condition code selector on every 32-bit ARM instruction. Besides, only small percentages of instructions are actually conditionalized in modern embedded applications, and conditional execution might not even lead to performance improvement on modern embedded processors.

Tow answered 4/3, 2014 at 10:54 Comment(3)
In addition, predication does not play well with out-of-order execution: it can require four data flow source operands (predicate, current value of destination register [needed if predicate is false], and two source register values) which must be checked for availability. AArch64's predicated instructions only require three sources (which is more likely to be supported by the OoO machinery [e.g., to support FMA] and is more friendly to a cracking into 2-source µops [like Alpha 21264 did for CMOV]).Dried
I couldn't even find a conditional branch to register or conditional return, and no conditional loads.Vivacious
@Vivacious Conditional loads are particularly tricky because you cannot easily split them into multiple µops.Sponge
S
13

It's somewhat misleading to say that conditional execution is not present in ARMv8. The issue is to understand why you don't want to execute some instructions. Perhaps in the early ARM days, the actual non-execution of instructions mattered (for power or whatever) but today the significance of this feature is that it allows you to avoid branches for small dumb jumps, for example code like a=(b>0? 1: 2). This sort of thing is more common than you might imagine --- conceptually it's things like MAX/MIN or ABS (though for some CPUs there may be instructions to do these particular tasks).

In ARMv8, while there are not general conditionally executed instructions there are a few instructions that perform the specific task I am describing, namely allowing you to avoid branching for short dumb jumps; CSEL is the most obvious example, though there are other cases (e.g. conditional setting of conditions) to handle other common patterns (in that case the pattern of C short-circuited expression evaluation).

IMHO what ARM has done here is what makes the most sense. They've extracted the feature of conditional execution that remains valuable on modern CPUs (avoid many branches) while changing the details of the implementation to match the micro-architecture of modern CPUs.

Sybilla answered 22/6, 2014 at 4:51 Comment(0)
A
10

One of the reasons is because of instruction encoding.

In thumb, you cannot squeeze four more bits into the tight 16-bit space while there isn't even enough room for the 3 high bits of the register operands and they must be reduced to a subset of only 8 registers. Note that in thumb2 you have a separate IT(E) instruction for selecting the conditions for the next 4 instructions. You can't store the condition in the same instruction though, because of the reason stated above.

For AArch64 the number of registers has been doubled compared to 32-bit ARM, but again you don't have any remaining bits for the new 3 high bits of the registers. If you want to use the old encoding then you must "borrow" either from the narrow 12-bit immediate or the 4-bit condition. 12 bits are already too small compared to other RISC architectures such as MIPS and reducing the number making everything worse, so removing the condition is a better choice. Because branch prediction has become more and more advanced, it won't be much a problem. It also makes implementing out-of-order execution easier because now there's one less thing to rename and care about

Armand answered 4/3, 2014 at 13:22 Comment(2)
AArch64 does include a nice selection of predicated instructions, like conditional increment-and-select which is more powerful than x86's CMOV. It's definitely still designed for efficient branchless code where that's appropriate. But they are only ALU instructions, not predicated-store or predicated-load that let you branchlessly do a conditional load or store from a pointer that might be invalid.Kipkipling
yes, that's better than conditional execution for every instructionArmand
B
6

Conditional execution is a good choice in implementation of many auxiliary or bit-twiddling routines, such as sorting, list or tree manipulation, number to string conversion, sqrt or long division. We could add UART drivers and extracting bit fields in routers. Those have a high branch to non-branch ratio with somewhat high unpredictability too.

However, once you get beyond the lowest level of services (or increase the abstraction level by using a higher level language), the code looks completely different: code blocks inside different branches of conditions consists more of moving data and calling sub-routines. Here the benefits of those extra 4 bits rapidly fade away. It's not only personal development but cultural: Culturally programming has grown from unstructured (Basic, Fortran, Assembler) towards structural. Different programming paradigms are supported better also in different instruction set architectures.

A technological compromise could have been the possibility to compress the five bit 'cond.S' field to four or three most frequently used combinations.


  • A paper on profile guided mode selection, giving power, cycle time, code size and instruction count benchmarks for popular SA-110 thumb/ARM compiled routines. Some routines are better in ARM mode and other do better in Thumb. It depends on the algorithm and ultimately the code/compiler.
Brundisium answered 4/3, 2014 at 12:24 Comment(2)
"It's not only personal development but cultural." - what?Shawm
@OJFord: I believe this means your personal development pipeline versus popular ones. OOP tends to jump all over the place without inlines. IPA can lead to quite significant optimizations with newer languages. I think it is a fair point from some one whos first language is not English.Manciple
S
3

On the old ARM v4, the conditional instructions only saved time if there was a high probability that they would end up getting executed, or if the probability was about 50%, then if there were just 2 to 4 of them in a row. If they weren't getting executed, then it was wasting cycles to have to fetch past them, versus the overhead of using a branch to get past them. If they were being executed, the branch would be fetched but not executed.

A minor nuisance is that when debugging, placing a break on a conditional instruction always resulted in taking a break on that instruction, regardless of the condition (unless there's some really smart debugger that my company didn't have).

Seena answered 4/3, 2014 at 22:28 Comment(0)
F
0

"Why are conditionally executed instructions not present ..." "Was conditional execution a mistake at the time, or have subsequent changes made it an expensive waste of instruction bits?"

Wikipedia's article on "Predication - Disadvantages" provides a bit of info:

"Disadvantages
Predication's primary drawback is in increased encoding space. In typical implementations, every instruction reserves a bitfield for the predicate specifying under what conditions that instruction should have an effect. When available memory is limited, as on embedded devices, this space cost can be prohibitive. However, some architectures such as Thumb-2 are able to avoid this issue (see below). Other detriments are the following:

  • Predication complicates the hardware by adding levels of logic to critical paths and potentially degrades clock speed.
  • A predicated block includes cycles for all operations, so shorter paths may take longer and be penalized.

Predication is most effective when paths are balanced or when the longest path is the most frequently executed, but determining such a path is very difficult at compile time, even in the presence of profiling information.

...

In the ARM architecture, the original 32-bit instruction set provides a feature called conditional execution that allows most instructions to be predicated by one of 13 predicates that are based on some combination of the four condition codes set by the previous instruction. ARM's Thumb instruction set (1994) dropped conditional execution to reduce the size of instructions so they could fit in 16 bits, but its successor, Thumb-2 (2003) overcame this problem by using a special instruction which has no effect other than to supply predicates for the following four instructions. The 64-bit instruction set introduced in ARMv8-A (2011) replaced conditional execution with conditional selection instructions.".

In "Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools", by Joseph A. Fisher, Paolo Faraboschi, and Cliff Young, on page 172:

"... full predication complicates the hardware, the ISA, and the compiler. Unlike speculation, which favors deeper pipelines and faster clocks, predication adds levels of logic to critical paths and potentially degrades clock speed. Predicate operands use precious encoding bits in all instructions, and bypassing operations with predicate operands considerably complicates the forwarding logic. Predication's benefits for acyclic or "control-oriented" code have been the subject of lively academic and commercial debate, and the jury is still out on whether the benefits of predication justify the massive hardware cost to support full predication.

The argument between full predication and partial predication is even more subtle. Full predication is more expressive and allows the compiler to predicate blocks that contain any combination of operations. Partial predication requires aggressive specu-lation and embeds some intrinsic limitations (for example, it cannot predicate blocks containing call operations). In terms of implementation complexity, full predication has much higher demands on the instruction encodings and the microarchitecture, as described previously, whereas partial predication with select operations is a good match for most microarchitectures and datapaths and has no impact on complexity, area, or speed.

Predication in the Embedded Domain
In the embedded domain, it is difficult to justify the code size penalty of a large set of predicate registers. Full predication implies a "pay up front' philosophy, in which the cost of the predicate machinery needs to be paid regardless of how often it is used. For example, adding 6 predicate bits to address 64 predicates helped push the IPF encoding to 42 bits per operation—an approach that would be prohibitively expensive for an embedded processor. ...".

Cost, TDP, and Patents, even the technical skill level necessary to develop a competing product all come into play. In this case it was a cost benefit realized from updated coding techniques, what was thought to be wanted wasn't really used, or at least not effectively (for the cost of its implementation).

As explained in another answer the ARM manual says little about the reason, less about it than the RISC manual does, here is what ARM had to say on page 8 of the "ARMv8 Instruction Set Overview":

"3 A64 OVERVIEW
The A64 instruction set provides similar functionality to the A32 and T32 instruction sets in AArch32 or ARMv7. However just as the addition of 32-bit instructions to the T32 instruction set rationalized some of the ARM ISA behaviors, the A64 instruction set includes further rationalizations. The highlights of the new instruction set are as follows:

  • ...

  • Reduced conditionality. Fewer instructions can set the condition flags. Only conditional branches, and a handful of data processing instructions read the condition flags. Conditional or predicated execution is not provided, and there is no equivalent of T32’s IT instruction (see §3.2).

...

3.2 Conditional Instructions
The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.

A very small set of “conditional data processing” instructions are provided. These instructions are unconditionally executed but use the condition flags as an extra input to the instruction. This set has been shown to be beneficial in situations where conditional branches predict poorly, or are otherwise inefficient.

Further information is provided in section "4.3 Condition Codes", but it doesn't rationalize how the decision was arrived at.

The designers of the RISC-V ISA (an unrelated recently-designed ISA) explain (http://riscv.org/spec/riscv-spec-v2.0.pdf on page 23) some of what goes into designing a processor:

"The conditional branches were designed to include arithmetic comparison operations between two registers (as also done in PA-RISC, Xtensa, and MIPS R6), rather than use condition codes (x86, ARM, SPARC, PowerPC), or to only compare one register against zero (Alpha, MIPS), or two registers only for equality (MIPS). This design was motivated by the observation that a combined compare-and-branch instruction fits into a regular pipeline, avoids additional condition code state or use of a temporary register, and reduces static code size and dynamic instruction fetch traffic.

...

Both conditional move and predicated instructions add complexity to out-of-order microarchitectures, adding an implicit third source operand due to the need to copy the original value of the destination architectural register into the renamed destination physical register if the predicate is false. Also, static compile-time decisions to use predication instead of branches can result in lower performance on inputs not included in the compiler training set, especially given that unpredictable branches are rare, and becoming rarer as branch prediction techniques improve.

We note that various microarchitectural techniques exist to dynamically convert unpredictable short forward branches into internally predicated code to avoid the cost of flushing pipelines on a branch mispredict [6, 10, 9] and have been implemented in commercial processors [17].

The simplest techniques just reduce the penalty of recovering from a mispredicted short forward branch by only flushing instructions in the branch shadow instead of the entire fetch pipeline, or by fetching instructions from both sides using wide instruction fetch or idle instruction fetch slots. More complex techniques for out-of-order cores add internal predicates on instructions in the branch shadow, with the internal predicate value written by the branch instruction, allowing the branch and following instructions to be executed speculatively and out-of-order with respect to other code [17].

[6] Timothy H. Heil and James E. Smith. Selective dual path execution. Technical report, Uni- versity of Wisconsin - Madison, November 1996.

[9] Hyesoon Kim, Onur Mutlu, Jared Stark, and Yale N. Patt. Wish branches: Combining conditional branching and predication for adaptive predicated execution. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, pages 43–54, 2005.

[10] A. Klauser, T. Austin, D. Grunwald, and B. Calder. Dynamic hammock predication for non-predicated instruction set architectures. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, PACT ’98, Washington, DC, USA, 1998.

[17] Balaram Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams. IBM POWER7 multicore server processor. IBM Journal of Research and Development, 55(3):1–1, 2011.

Removing predicated instructions on 64-bit ARM freed four bits on the encoding of every instruction, this allowed adding one bit to each register field, thus doubling the number of registers.

In my opinion it is an error to omit elison ability in a Server Processor getting pinned to a Fabric, but tradeoffs are made. It is not a mistake (to have it, well implemented), it is expensive, it's not a waste (the bits are smart and mind their own business). Conditionals were an easier/better choice.

It is like any CPU Extension, or adding a GPU; if you can make skillful use of your Tools then your good to go, otherwise pack light.

Wikipedia Quote: "According to different benchmarks, TSX can provide around 40% faster applications execution in specific workloads, and 4–5 times more database transactions per second (TPS).".

It's 'costly' (for some situations) but important for the current style of programming, or more pessimistically a means to score far higher in synthetic Benchmarks.

Someday it will be as easy as Lego, and you will be able to ask 'it' to assemble itself and do your biddingask; until then the Processor MUST support Programmer (and Compiler writers) laziness - thus the rarity of Programs that can run mostly on the GPU (but we are getting there).

Therefore, the removal of (great) features that are thought unwanted or were not implemented in a cost effective and competitive manner.

Thus, TSX rules currently; but ARM CPUs need fancy Threads for their Fabric too.

URL References:

AMD: https://en.wikipedia.org/wiki/Advanced_Synchronization_Facility

Intel: https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

.

Forkey answered 10/8, 2015 at 3:20 Comment(0)
M
-1

Just like, the defer slot in mips being a trick (at the time), conditional execution in arm is a trick (at the time), as is the pc being two instructions ahead. Now down the road how much affect do they have? Will ARMs branch predictor actually make that much difference or is the real answer they needed more bits in a 32 bit instruction word and like thumb the first and easiest thing to get rid of is the condition bits.

it is not too difficult to do some performance tests to see how good or back the branch predictor really is, I tried it with unconditional branches on an arm11, granted that is an old architecture now but still in wide use. It was difficult at best to get the branch prediction to show any improvement, and in no way, shape, or form could it compete with the conditional execution. I have not repeated these experience on anything in the cortex-a family.

Metencephalon answered 4/3, 2014 at 16:16 Comment(3)
But there is the case that you can massage your code or data in a way to make branch prediction better. You can design a better algorithm / code. You can help compiler. You can sort your data so prediction doesn't face randomness. So all these compared to a design choice that is in the core of an ISA? I can sympathize with designers.Tow
I was doing this in assembly and the compiler was not involved I was giving the processor every opportunity. Basically a bunch of nops with the instruction under test, and first changed the alignment and amount of items in the loop to find the sweet spots for the conditional branch at the end and the entry point at the top of the loop (noticeable performance differences based on where those were) then adjusted where the instruction under test landed within the loop. repeated a similar test with conditional execution.Metencephalon
as Aki is perhaps saying you dont have to look at too much compiled code to realize that the compilers often cannot take advantage of the conditional execution and have to branch anyway. 4 bits per instruction is very expensive real estate for something we may not be using. 3 bits per instruction instantly doubles the number of possible registers for three operand instructions, I would much rather have twice as many registers than conditional execution.Metencephalon

© 2022 - 2024 — McMap. All rights reserved.