How can I reconcile short conditional jumps with branch target alignments with `.align` in Delphi assembler?
Asked Answered
R

1

6

How to reconcile short conditional jumps with branch target alignments in Delphi assembler?

I’m using Delphi version 10.2 Tokyo, for 32-bit and 64-bit assembly, to write some functions entirely using the assembly.

If I don’t use the .align, the compiler correctly encodes short conditional jumps instructions (2 byte instruction which consists of an 1-byte opcode 074h and 1-byte relative offset -+ up to 07Fh). But if I ever put even a single .align, even as small as .align 4 -- all conditional jump instructions that are located before the .align and have destination located after the .align - in this case all these instructions become 6-byte instructions, not 2-byte as they should be. Only the instructions that are located after the .align remain correctly encoded as 2-byte short.

Delphi Assembler doesn’t accept ‘short’ prefix.

How can I reconcile short conditional jumps with branch target alignments with .align in Delphi assembler?

Here is a sample procedure – please note that there is an .align in the middle.

    procedure Test; assembler;
    label
      label1, label2, label3;
    asm
      mov     al, 1
      cmp     al, 2
      je      label1
      je      label2
      je      label3
    label1:
      mov     al, 3
      cmp     al, 4
      je      label1
      je      label2
      je      label3
      mov     al, 5
      .align 4
    label2:
      cmp     al, 6
      je      label1
      je      label2
      je      label3
      mov     al, 7
      cmp     al, 8
      je      label1
      je      label2
      je      label3
    label3:
    end;

Here is how it is encoded – conditional jumps, located before the align, that point to to label2 and label3 (after the align) are encoded as 6-byte instructions (this is a 64-bit CPU target):

0041C354 B001          mov al,$01      //   mov     al, 1
0041C356 3C02          cmp al,$02      //   cmp     al, 2
0041C358 740C          jz $0041c366    //   je      label1
0041C35A 0F841C000000  jz $0041c37c    //   je      label2
0041C360 0F8426000000  jz $0041c38c    //   je      label3
0041C366 B003          mov al,$03 //label1: mov al, 3
0041C368 3C04          cmp al,$04      //   cmp     al, 4
0041C36A 74FA          jz $0041c366    //   je      label1
0041C36C 0F840A000000  jz $0041c37c    //   je      label2
0041C372 0F8414000000  jz $0041c38c    //   je      label3
0041C378 B005          mov al,$05      //   mov     al, 5
0041C37A 8BC0          mov eax,eax     //  <-- a 2-byte dummy instruction, inserted by ".align 4" (almost a 2-byte NOP)
0041C37C 3C06          cmp al,$06 //label2: cmp al, 6
0041C37E 74E6          jz $0041c366    //   je      label1
0041C380 74FA          jz $0041c37c    //   je      label2
0041C382 7408          jz $0041c38c    //   je      label3
0041C384 B007          mov al,$07      //   mov     al, 7
0041C386 3C08          cmp al,$08      //   cmp     al, 8
0041C388 74DC          jz $0041c366    //   je      label1
0041C38A 74F0          jz $0041c37c    //   je      label2
0041C38C C3            ret        // label3:

But if I remove the .align - all the instructions have correct size - just 2 bytes as they used to be:

0041C354 B001          mov al,$01      //   mov     al, 1
0041C356 3C02          cmp al,$02      //   cmp     al, 2
0041C358 7404          jz $0041c35e    //   je      label1
0041C35A 740E          jz $0041c36a    //   je      label2
0041C35C 741C          jz $0041c37a    //   je      label3
0041C35E B003          mov al,$03 //label1: mov     al, 3
0041C360 3C04          cmp al,$04      //   cmp     al, 4
0041C362 74FA          jz $0041c35e    //   je      label1
0041C364 7404          jz $0041c36a    //   je      label2
0041C366 7412          jz $0041c37a    //   je      label3
0041C368 B005          mov al,$05      //   mov     al, 5
0041C36A 3C06          cmp al,$06 //.align 4 label2:cmp al, 6
0041C36C 74F0          jz $0041c35e    //   je      label1
0041C36E 74FA          jz $0041c36a    //   je      label2
0041C370 7408          jz $0041c37a    //   je      label3
0041C372 B007          mov al,$07      //   mov     al, 7
0041C374 3C08          cmp al,$08      //   cmp     al, 8
0041C376 74E6          jz $0041c35e    //   je      label1
0041C378 74F0          jz $0041c36a    //   je      label2
0041C37A C3            ret             //   je      label3
                                //  label3: 

Back to conditional jumps instructions: how can I reconcile short conditional jumps with branch target alignments with .align in Delphi assembler?

I acknowledge that the benefit of aligning branch targets on processors like SkyLake and later is slim and I understand that I can just refrain from using .align - it will also save the code size. But I want to know how can I use Delphi assembler to generate short jumps with align. This problem persists in 32-bit target also, not only in the 64-bit one.

Railing answered 14/7, 2017 at 21:49 Comment(15)
Relaxation interacts with alignment in an annoying way, so I can see how this might happen in a simple implementation, but if that's why this happens there's probably no fix (except by changing the assembler)Joellajoelle
Inconsiderate usage of decor..., which amount to about 900kb - think of mobile. Perhaps you can substitute them with some charts, with a limited number of colors they should come up small. Or with some "lorem ipsum" if it's that you feel that your post is too short for your taste.Deeprooted
@Sertac Akyz - thank you for pointing that out, I'm sorry about that.Railing
I see that only forward branches get the long addresses. As @harold said, probably nothing you can do about it. Changing the assembler is not so easy, but possible.Rolan
@RudyVelthuis - at least we will know about this issue and would not use .align at all with that assembler.Railing
Terminology: Intel's manual calls jcc rel8 a "short" jump, and jcc rel32 a "near" jump. Both of them are near jumps, as opposed to a far jump to a different code segment. So "short" means "near with compact encoding". The online HTML versions get messy after the first page of the table :(Beefburger
@PeterCordes - thank you, I've just messed up with the terms, of course I've meant short, not near (since near was only relevant for 16-bit code, not for 32-bit or 64-bit code).Railing
It sounds like your assembler is way overcautious about jumping across .align directives. Maybe it's a 1-pass assembler that can't go back and prove that the branch distances are all short? The labels are local, right, so it can't be worried about the linker needing to fill in a different address. Otherwise that would be a problem all the time, not just with .align.Beefburger
@PeterCordes It is a multi-pass assembler since it correctly puts short conditional jumps and long conditional jumps, where appropriate. Probably, there is either a bug or they think that if a programmer have ever used .align - size is no longer an issue and they use large versions. Delphi is a mega-ultra-fast compiler and they might have sacrificed code quality to keep quick compilation speed. Maybe they might not have thought that short jumps are a part of the branch prediction mechanism.Railing
I do use .align with that assembler. That you get a few long forward branches shouldn't matter a lot. Most branches that matter (e.g. in loops) are backward anyway, and there it works.Rolan
@PeterCordes You wrote (quote): "Terminology: Intel's manual calls jcc rel8 a "short" jump, and jcc rel32 a "near" jump. Both of them are near jumps, as opposed to a far jump to a different code segment". Did you mean that all near jumps (both "short", 1-byte relative and longer (2-byte and 4-byte relative)) take benefit of branch prediction, as opposing to a "far" jump to a different segment?Railing
@PeterCordes There is, however, a note in the <agner.org/optimize/microarchitecture.pdf> see “Close jumps on PMMX” (quote): “If two control transfer instructions are so close together that they differ only in bits 0-1 of the address, then we have the problem of a shared BTB entry.[…] There are various ways to solve this problem:[...]2. Change a short jump to a near jump (with 4 bytes displacement) so that the end of the instruction is moved further down[...] 3. Put in some instruction between the two control transfer instructions.”Railing
@PeterCordes Therefore, maybe Delphi compiler decides, that if it is asked to „Put in some instruction between the two control transfer instructions” by the „.align” assembly directive, it will also „Change a short jump to a near jump (with 4 bytes displacement)”?Railing
@MaximMasiutin: Yes, both kinds of near jumps (rel8 and rel32) are common in real programs, and prediction works for them. (And rel16 in 16-bit code). I don't know how far jmp executes; it's irrelevant for performance because they're basically never used. (Except on WOW64, apparently, where 32-bit DLLs call into 64-bit code instead of having the kernel support an alternate 32-bit sysenter ABI like Linux does.) I'd guess that far jumps aren't predicted, but it's also possible that the CPU optimistically assumes that there's no call-gate or whatever.Beefburger
@MaximMasiutin: Did you test your hypothesis about tuning for P5 branch prediction? I can think of a few ways: 1) check if you still get rel32 when there already are other instructions between forward jumps across a .align. 2) Check if it chooses to use rel32 for any other cases when it doesn't have to, like backward, or not across a .align. I think you'll find that it's just forward branches across .align that get rel32 when they don't need it, which definitely doesn't sound intentional the way GCC's similar rep ret tuning for AMD K10 was.Beefburger
B
2

Unless your assembler has an option to do better branch-displacement optimization (which might take repeated passes), you're probably out of luck. (Of course you could manually do all the alignment yourself, but that has to be re-done every time you change anything.)

Or you could use a different assembler to assemble. But as I expected, that's highly undesirable because you lose access to Delphi-specific stuff like object layout for things declared outside of the asm. (Thanks @Rudy for the comment.)

It's possible that you could write some of your function in Delphi assembler and do as much as possible of the Delphi-specific stuff there. Write the critical loop part in another assembler, hexdump dump its machine-code output into a db pseudo-instruction that you put in the middle of your Delphi assembly.

This could work ok if the start of every function is at least as aligned as anything inside a function, but you'd probably end up wasting instructions or putting constants into registers for use by the NASM part, which would probably be worse than just having longer branches.


Only the instructions that are located after the .align remain correctly encoded as 2-byte short

That isn't quite accurate. The first je label1 looks ok, and it's before the .align.

It looks like any branch that goes forward across a not-yet-evaluated .align directive leaves room for a rel32, and the assembler never comes back and fixes it. Every other case seems fine: backward branches across a .align, and forward branches that don't cross a .align.


Branch-displacement optimization is not an easy problem, especially when there are .align directives. This appears to be a really sub-optimal implementation, though.

Related: Why is the "start small" algorithm for branch displacement not optimal? for more about the algorithms assemblers use for branch-displacement optimization. Even good assemblers probably don't make optimal choices, especially when there are .align directives.

Beefburger answered 15/7, 2017 at 19:47 Comment(7)
Did I understand correctly that only the instructions that effectively modify 32-bit registers under 64-bit mode also clear the highest bits 63-32, thus the instruction mov eax,eax inserted here by Delphi assembler would not clear the highest bits?Railing
@MaximMasiutin: No, mov eax,eax zero-extends into rax. That's really bad behaviour for .align! There are no special-cases where writing a register with operand-size=32 doesn't actually zero the upper 32.Beefburger
@MaximMasiutin: That's why 0x90 NOP needs its own insn set ref entry, because xchg eax,eax is no longer a NOP in x86-64. (The entry for xchg doesn't mention that 0x90 no longer even technically encodes xchg eax,eax in long mode, but it's definitely still a NOP.)Beefburger
@PeterCordes: Delphi can use object files produced with other assemblers, like TASM, MASM or NASM (and probably FASM etc. too, never tried). So one could use antoher assembler, but that would mean doing without many of the features that make the Delphi assembler so useful, e.g. directives that know the VMT index of virtual methods, that know the size of structs and offset of struct members, that can import stuff from other Delphi modules, that can access private runtime routines, etc.etc. i.e. doing without many Delphi-specific things. If this is the only problem...Rolan
... that is not enough to make me switch to, say, NASM. I did one day translate one of my built-in assembler-heavy sources to NASM, but it was tedious. Types, structs etc. had to be re-declared, VMTOFFSET and other directives were missing, extern identifiers had to be re-declared, etc.Rolan
@PeterCordes: .ALIGN inserts code that is specific for the bitness, so it would probably not insert mov eax,eax in 64 bit code (if that were a different size or had some side effects). It would use something different (if necessary, a second NOP) and harmless.Rolan
@RudyVelthuis: Thanks for the update. Maxim seems to be claiming that the asm in the question is for a 64-bit target, but it seems unlikely that nobody had noticed .align clobbering RAX before now. Easy to test, though. mov rax, -1 / nop / .align something / ret with more or fewer NOPs to test all padding widths. Anyway, updated the NASM suggestion since in some cases it might be possible to write the inner loop part in NASM after setting up registers in Delphi. But it still probably sucks more than a couple longer branches most of the time.Beefburger

© 2022 - 2024 — McMap. All rights reserved.