Why is the "start small" algorithm for branch displacement not optimal?

Asked 20/1, 2016 at 21:40 Answered 22/1, 2016 at 6:19

Solved algorithm assembly optimization encoding

[question in bold at the bottom]

When an assembler generates a binary encoding it needs to decide whether to make each branch long or short, short being better, if possible. This part of the assembler is called the branch displacement optimization (BDO) algorithm. A typical approach is that the assembler makes all the branch encodings short (if they are less than some threshold), then iteratively increases any branches jump to longs that do not reach. This, of course, can cause other branches to be converted to long jumps. So, the assembler has to keep passing through the jump list until no more upsizing is required. This quadratic time approach would seem to be an optimal algorithm to me, but supposedly BDO is NP-complete and this approach is not actually optimal.

Randall Hyde provided a counter-example:

                  .386 
                  .model  flat, syscall
00000000          .code
00000000          _HLAMain        proc


00000000  E9 00000016          jmpLbl:         jmp     [near ptr] target
00000005 = 00000005            jmpSize         =       $-jmpLbl
00000005  00000016 [                           byte    32 - jmpSize*2
dup
(0)
            00
           ]
0000001B                       target:
0000001B                       _HLAMain        endp
                                                end

By adding the part in brackets "[near ptr]" and forcing a 5-byte encoding the binary actually ends up being shorter because the allocated array is smaller by double the amount of the jump size. Thus, by making a jump encoding shorter, the final code is actually longer.

This seems like an extremely pathological case to me, and not really relevant because the branch encodings are still smaller, its just this bizarre side-effect on a non-branch part of the program, that causes the binary to become larger. Since the branch encodings themselves are still smaller, I don't really consider that a valid counter-example to the "start small" algorithm.

Can I consider the start-small algorithm an optimal BDO algorithm or does there exist a realistic case in which it does not provide a minimal encoding size for all branches?

Stacy answered 20/1, 2016 at 21:40 Comment(12)

Do you want to considered aligned blocks as part of the problem? – Fiddlestick 20/1, 2016 at 21:56

@harold No. For the purpose of the question I am curious if I will be producing the minimum instruction codings with start-small. I consider alignment issues and stuff like that to be irrelevant. I mean if I end up generating extra 0-bytes as padding, I don't consider that a violation of optimality. I would rather have shorter jump instructions and a lot of padding, than vice versa. – Stacy 20/1, 2016 at 22:7

But the presence of alignment can affect the size of the jumps. Anyhow, in order to trigger the NP-completeness, you need "anomalous jumps" - jumps that first must be long and then can become short when other jump(s) turn out to be long. If you consider only jumps and boring instructions, then this cannot happen. – Fiddlestick 20/1, 2016 at 22:10

Having some alignment somewhere can mean that short jump going backwards across the "filler" could "make it" only after some jump before their target is expanded, so expanding some jump that doesn't need it allows other jumps to be small. It's not too crazy either, loops are often aligned. – Fiddlestick 20/1, 2016 at 22:14

@harold: It would be nifty if there was a way to ask the assembler to align by using longer encodings for previous instructions (e.g. 4B displacements in addressing modes or immediates, or unnecessary / repeated prefixes), instead of inserting NOPs. This would be most useful for aligning loops, and other cases where the NOPs would actually run. – 3d 20/1, 2016 at 23:36

Even if this problem is formally NP-complete, wouldn't the routine being optimized have to be unrealistically large for that to be a significant concern? – Gridiron 21/1, 2016 at 2:47

Off topic, but maybe interesting: The naive algorithm doesn't need to be quadratic. It's actually a least-fixed-point problem, and can be solved in linear time provided the number of times an instance is added to the candidate set is O(1). In this case, an instance is a short branch and the candidate set is the set of short branches which might have become long as the result of a change to another branch. The O(1) constraint is met because the range of effect of a change to a branch statement is limited to the range of a short branch, and only a finite number of branches are in that range. – Heteropterous 21/1, 2016 at 15:46

@rici: That sounds interesting, could you elaborate a bit further? Thanks :) – Justitia 24/1, 2016 at 11:56

@j_random_hacker: changing a short branch to a long branch can only affect a fixed maximum number of other short branches. The key is to identify these in O(1), which could be done, for example, by linking all the branches together in order, and also linking each branch to the earliest preceding branch within the short branch limit. Then when you change a branch from short to long, you just need to traverse the list of candidates and see which ones might be affected in order to add them back to the worklist. – Heteropterous 24/1, 2016 at 20:58

@rici: Thanks, I see how those two lists could be built in O(1) time per jump by walking two pointers forward. Your mention of "least fixed point problem" suggests this is an example of a general class of problems that can be solved the same way, but the Wikipedia page wasn't too helpful... Know of any helpful links in that direction? – Justitia 24/1, 2016 at 23:48

@j_random_hacker: The classic LFP algorithm is to start with an approximation which is known to be too small, and incrementally add to it until nothing changes (or start big, and subtract, which occasionally works out better). The key is the datastructures: if you can handle an addition in constant time (including identifying candidates for future addition), then the algorithm is linear time. (In another class of solutions, candidate identification is log time and the final algorithm is n log n.) See partition refinement, for example. – Heteropterous 26/1, 2016 at 16:39

@rici: Interesting, thanks! – Justitia 27/1, 2016 at 3:48

Here's a proof that, in the absence of the anomalous jumps mentioned by harold in the comments, the "start small" algorithm is optimal:

First, let's establish that "start small" always produces a feasible solution -- that is, one that doesn't contain any short encoding of a too-long jump. The algorithm essentially amounts to repeatedly asking the question "Is it feasible yet?" and lengthening some jump encoding if not, so clearly if it terminates, then the solution it produces must be feasible. Since each iteration lengthens some jump, and no jump is ever lengthened more than once, this algorithm must eventually terminate after at most nJumps iterations, so the solution must be feasible.

Now suppose to the contrary that the algorithm can produce a suboptimal solution X. Let Y be some optimal solution. We can represent a solution as the subset of jump instructions that are lengthened. We know that |X \ Y| >= 1 -- that is, that there is at least 1 instruction lengthening in X that is not also in Y -- because otherwise X would be a subset of Y, and since Y is optimal by assumption and X is known to be feasible, it would follow that X = Y, meaning that X would itself be an optimal solution, which would contradict our original assumption about X.

From among the instructions in X \ Y, choose i to be the one that was lengthened first by the "start small" algorithm, and let Z be the subset of Y (and of X) consisting of all instructions already lengthened by the algorithm prior to this time. Since the "start small" algorithm decided to lengthen i's encoding, it must have been the case that by that point in time (i.e., after lengthening all the instructions in Z), i's jump displacement was too big for a short encoding. (Note that while some of the lengthenings in Z may have pushed i's jump displacement past the critical point, this is by no means necessary -- maybe i's displacement was above the threshold from the beginning. All we can know, and all we need to know, is that i's jump displacement was above the threshold by the time Z had finished being processed.) But now look back at the optimal solution Y, and note that none of the other lengthenings in Y -- i.e., in Y \ Z -- are able to reduce i's jump displacement back down, so, since i's displacement is above the threshold but its encoding is not lengthened by Y, Y is not even feasible! An infeasible solution cannot be optimal, so the existence of such a non-lengthened instruction i in Y would contradict the assumption that Y is optimal -- meaning that no such i can exist.

Justitia answered 21/1, 2016 at 3:48 Comment(0)

j_random_hacker's argument that Start Small is optimal for the simplified case where there is no padding sounds reasonable. However, it's not very useful outside optimized-for-size functions. Real asm does have ALIGN directives, and it does make a difference.

Here's the simplest example I could construct of a case where Start Small doesn't give an optimal result (tested with NASM and YASM). Use jz near .target0 to force a long encoding, moving another_function: 32 bytes earlier and reducing padding within func.

func:
.target0:               ; anywhere nearby
    jz  .target0        ; (B0)  short encoding is easily possible
.target1:
   times 10 vpermilps xmm14, xmm15, [rdi+12345]
        ; A long B0 doesn't push this past a 32B boundary, so short or long B0 doesn't matter
ALIGN 32
.loop:
   times 12 xor r15d,r15d
    jz  .target1         ; (B1)  short encoding only possible if B0 is long
   times 18 xor r15d,r15d
    ret   ; A long B1 does push this just past a 32B boundary.

ALIGN 32
another_function:
    xor  eax,eax
    ret

If B0 is short, then B1 has to be long to reach target1.
If B0 is long, it pushes target1 closer to B1, allowing a short encoding to reach.

So at most one of B0 and B1 can have a short encoding, but it matters which one is short. A short B0 means 3 more bytes of alignment padding, with no saving in code-size. A long B0 allowing a short B1 does save total code size. In my example, I've illustrated the simplest way that can happen: by pushing the end of the code after B1 past the boundary of the next alignment. It could also affect other branches, e.g. requiring a long encoding for a branch to .loop.

Desired: B0 long, B1 short.
Start-Small result: B0 short, B1 long. (Their initial first-pass states.) Start-Small doesn't try lengthening B0 and shortening B1 to see if it reduces total padding, or just padding that gets executed (ideally weighted by trip count).

A 4-byte NOP before .loop, and 31 bytes of NOPs before another_func, so it starts at 0x400160 instead of the 0x400140 that we get from using jz near .target0 which leads to a short encoding for B1.

Note that a long encoding for B0 itself is not the only way to achieve a short encoding for B1. A longer-than-necessary encoding for any of the instructions before .target1 could also do the trick. (e.g. a 4B displacement or immediate, instead of a 1B. Or an unnecessary or repeated prefix.)

Unfortunately, no assembler I know of supports padding this way; only with nop. What methods can be used to efficiently extend instruction length on modern x86?

Often, there isn't even a jump over the long-NOP at the start of a loop, so more padding is potentially worse for performance (if multiple NOPs are needed, or the code runs on a CPU like Atom or Silvermont that is really slow with lots of prefixes, which got used because the assembler wasn't tuning for Silvermont).

Note that compiler output rarely has jumps between functions (usually just for tail-call optimization). x86 doesn't have a short encoding for call. Hand-written asm can do whatever it wants, but spaghetti code is (hopefully?) still uncommon on a large scale.

I think it's likely that the BDO problem can be broken into multiple independent sub-problems for most asm source files, usually each function being a separate problem. This means that even non-polynomial-complexity algorithms may be viable.

Some shortcuts to help break up the problem will help: e.g. detect when a long encoding is definitely needed, even if all intervening branches use a short encoding. This will allow breaking dependencies between sub-problems when the only thing connecting them was a tail-call between two distant functions.

I'm not sure where to actually start making an algorithm to find a globally optimal solution. If we're willing to consider expanding other instructions to move branch targets, the search space is quite huge. However, I think we only have to consider branches that cross alignment-padding.

The possible cases are:

padding before a branch target for backward branches
padding before a branch instruction for forward branches

Doing a good job with this might be easier if we embed some knowledge of microarchitectural optimization into the assembler: e.g. always try to have branch targets start near the start of 16B insn fetch blocks, and definitely not right at the end. An Intel uop cache line can only cache uops from within one 32B block, so 32B boundaries are important for the uop cache. L1 I$ line size is 64B, and the page size is 4kiB. (The assembler won't know which code is hot and which is cold, though. Having hot code span two pages might be worse than slightly larger code-size.)

Having a multi-uop instruction at the beginning of an instruction decode group is also much better than having it anywhere else, for Intel and AMD. (Less so for Intel CPUs with a uop cache). Figuring out which path the CPU will take through the code most of the time, and where the instruction decode boundaries will be, is probably well beyond what an assembler can manage.

3d answered 22/1, 2016 at 6:19 Comment(2)

Another case where you can quickly decide whether a jump's encoding should be expanded or not is when, even if all other jumps between itself and its target address were expanded, its displacement would still be small enough to allow a small encoding. – Justitia 24/1, 2016 at 11:55

Update: Intel's recommended workaround for the the performance potholes caused by their microcode fix for the JCC erratum is to lengthen instructions instead of using NOPs, because it can involve padding inside inner-most loops. So finally assemblers like GAS know how to lengthen instructions, but they still don't do it for align directives :/ – 3d 20/5, 2023 at 1:16

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags