j_random_hacker's argument that Start Small is optimal for the simplified case where there is no padding sounds reasonable. However, it's not very useful outside optimized-for-size functions. Real asm does have ALIGN
directives, and it does make a difference.
Here's the simplest example I could construct of a case where Start Small doesn't give an optimal result (tested with NASM and YASM). Use jz near .target0
to force a long encoding, moving another_function:
32 bytes earlier and reducing padding within func
.
func:
.target0: ; anywhere nearby
jz .target0 ; (B0) short encoding is easily possible
.target1:
times 10 vpermilps xmm14, xmm15, [rdi+12345]
; A long B0 doesn't push this past a 32B boundary, so short or long B0 doesn't matter
ALIGN 32
.loop:
times 12 xor r15d,r15d
jz .target1 ; (B1) short encoding only possible if B0 is long
times 18 xor r15d,r15d
ret ; A long B1 does push this just past a 32B boundary.
ALIGN 32
another_function:
xor eax,eax
ret
If B0 is short, then B1 has to be long to reach target1.
If B0 is long, it pushes target1 closer to B1, allowing a short encoding to reach.
So at most one of B0 and B1 can have a short encoding, but it matters which one is short. A short B0 means 3 more bytes of alignment padding, with no saving in code-size. A long B0 allowing a short B1 does save total code size. In my example, I've illustrated the simplest way that can happen: by pushing the end of the code after B1 past the boundary of the next alignment. It could also affect other branches, e.g. requiring a long encoding for a branch to .loop
.
- Desired: B0 long, B1 short.
Start-Small result: B0 short, B1 long. (Their initial first-pass states.) Start-Small doesn't try lengthening B0 and shortening B1 to see if it reduces total padding, or just padding that gets executed (ideally weighted by trip count).
A 4-byte NOP before .loop
, and 31 bytes of NOPs before another_func
, so it starts at 0x400160
instead of the 0x400140
that we get from using jz near .target0
which leads to a short encoding for B1.
Note that a long encoding for B0 itself is not the only way to achieve a short encoding for B1. A longer-than-necessary encoding for any of the instructions before .target1
could also do the trick. (e.g. a 4B displacement or immediate, instead of a 1B. Or an unnecessary or repeated prefix.)
Unfortunately, no assembler I know of supports padding this way; only with nop
. What methods can be used to efficiently extend instruction length on modern x86?
Often, there isn't even a jump over the long-NOP
at the start of a loop, so more padding is potentially worse for performance (if multiple NOPs are needed, or the code runs on a CPU like Atom or Silvermont that is really slow with lots of prefixes, which got used because the assembler wasn't tuning for Silvermont).
Note that compiler output rarely has jumps between functions (usually just for tail-call optimization). x86 doesn't have a short encoding for call
. Hand-written asm can do whatever it wants, but spaghetti code is (hopefully?) still uncommon on a large scale.
I think it's likely that the BDO problem can be broken into multiple independent sub-problems for most asm source files, usually each function being a separate problem. This means that even non-polynomial-complexity algorithms may be viable.
Some shortcuts to help break up the problem will help: e.g. detect when a long encoding is definitely needed, even if all intervening branches use a short encoding. This will allow breaking dependencies between sub-problems when the only thing connecting them was a tail-call between two distant functions.
I'm not sure where to actually start making an algorithm to find a globally optimal solution. If we're willing to consider expanding other instructions to move branch targets, the search space is quite huge. However, I think we only have to consider branches that cross alignment-padding.
The possible cases are:
- padding before a branch target for backward branches
- padding before a branch instruction for forward branches
Doing a good job with this might be easier if we embed some knowledge of microarchitectural optimization into the assembler: e.g. always try to have branch targets start near the start of 16B insn fetch blocks, and definitely not right at the end. An Intel uop cache line can only cache uops from within one 32B block, so 32B boundaries are important for the uop cache. L1 I$ line size is 64B, and the page size is 4kiB. (The assembler won't know which code is hot and which is cold, though. Having hot code span two pages might be worse than slightly larger code-size.)
Having a multi-uop instruction at the beginning of an instruction decode group is also much better than having it anywhere else, for Intel and AMD. (Less so for Intel CPUs with a uop cache). Figuring out which path the CPU will take through the code most of the time, and where the instruction decode boundaries will be, is probably well beyond what an assembler can manage.