printf gets stuck in an infinite loop with AL = 10 on x86-64 Linux with older gcc

Asked 5/5, 2020 at 20:26 Answered 6/5, 2020 at 0:33

Solved gcc assembly x86-64 calling-convention abi

Very simple assembly introduction code.
Seems to compile ok through gcc -o prog1 prog1.s, then ./prog1 just skips a line and shows nothing, like waiting an input the code doesn't ask. What's wrong?
Using gcc (Debian 4.7.2-5) 4.7.2 in 64-bit gNewSense running on VMware. Code:

/*
int nums[] = {10, -21, -30, 45};
int main() {
  int i, *p;
  for (i = 0, p = nums; i != 4; i++, p++)
    printf("%d\n", *p);
  return 0;
}
*/

.data
nums:  .int  10, -21, -30, 45
Sf:  .string "%d\n"    # string de formato para printf

.text
.globl  main
main:

/********************************************************/
/* mantenha este trecho aqui e nao mexa - prologo !!!   */
  pushq   %rbp
  movq    %rsp, %rbp
  subq    $16, %rsp
  movq    %rbx, -8(%rbp)
  movq    %r12, -16(%rbp)
/********************************************************/

  movl  $0, %ebx  /* ebx = 0; */
  movq  $nums, %r12  /* r12 = &nums */

L1:
  cmpl  $4, %ebx  /* if (ebx == 4) ? */
  je  L2          /* goto L2 */

  movl  (%r12), %eax    /* eax = *r12 */

/*************************************************************/
/* este trecho imprime o valor de %eax (estraga %eax)  */
  movq    $Sf, %rdi    /* primeiro parametro (ponteiro)*/
  movl    %eax, %esi   /* segundo parametro  (inteiro) */
  call  printf       /* chama a funcao da biblioteca */
/*************************************************************/

  addl  $1, %ebx  /* ebx += 1; */
  addq  $4, %r12  /* r12 += 4; */
  jmp  L1         /* goto L1; */

L2:  
/***************************************************************/
/* mantenha este trecho aqui e nao mexa - finalizacao!!!!      */
  movq  $0, %rax  /* rax = 0  (valor de retorno) */
  movq  -8(%rbp), %rbx
  movq  -16(%rbp), %r12
  leave
  ret      
/***************************************************************/

Drier answered 5/5, 2020 at 20:26 Comment(19)

It would make things a great deal easier if you could translate the comments to English and explain what sort of output you expect (I suppose the same output as the C program you listed above). – Tobi 5/5, 2020 at 20:36

For me, it works like the C code in your comment does. Are you sure you're compiling and running what you think you are? – Freeway 5/5, 2020 at 20:38

@Tobi You edited right. The portuguese comments are basic explanations/don't change this. – Drier 5/5, 2020 at 20:45

@JosephSible-ReinstateMonica Yes I am, as the commands indicate. – Drier 5/5, 2020 at 20:47

@Drier You should double-check that. As it stands, your problem isn't reproducible. – Freeway 5/5, 2020 at 20:49

@JosephSible-ReinstateMonica I'm x-checking that for hours.That's how I finally gave up and came here. – Drier 5/5, 2020 at 20:52

If you compile and run your C code, does it work the way you expect? If not, then that points to some problem with your system. – Freeway 5/5, 2020 at 21:0

Yes I runned C on it the entire month, this problem only happened right now with assembly. – Drier 5/5, 2020 at 21:2

You should zero %al before call printf as you don't use any SSE registers for arguments. Still, that is unlikely to cause this problem. You could try running the program through strace or of course use a debugger. – Tobitobiah 5/5, 2020 at 21:21

@Tobitobiah after gcc -Wall -g prog1.s, gdb a.out, layout next, run + ^C: 0x00007ffff7a9e1d0 <printf+64> jmpq *%rax highlighted. In regular terminal: Program received signal SIGINT, Interrupt. 0x00007ffff7a9e1d0 in printf () from /lib/x86_64-linux-gnu/libc.so.6 Now what? – Drier 5/5, 2020 at 22:29

That is very interesting. What is p/a $rax? If that points back to itself for whatever reason, then it would be an endless loop. – Tobitobiah 5/5, 2020 at 22:33

A infinite loop is precisely what I suspect. Sorry I don't know what you mean by p/a but %rax is where the '0' return value of the main function is stored. If $rax refers to the memory address associated to it I SUPPOSE it's the mentioned above. Btw ran other assembly code slightly different and it's all good with the new one. – Drier 5/5, 2020 at 22:50

I meant in gdb when you are stopped the the jmpq do a p/a $rax to see the value. – Tobitobiah 5/5, 2020 at 22:53

Program received signal SIGINT, Interrupt. 0x00007ffff7a9e1d0 in printf () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) p/a $rax p/a $rax $1 = 0x7ffff7a9e1ca <printf+58> – Drier 5/5, 2020 at 23:2

Ahha yeah, that's pointing to just before the jmp so it's an endless loop. Very strange. – Tobitobiah 5/5, 2020 at 23:6

Yeah... and just rolled smooth and peachy in onlineGDB right now. Guess we have a OS or VM stranger thing here. Not my thing at the moment, but thank you very much for the inputs anyhow. Learned some indirectly. – Drier 5/5, 2020 at 23:37

Wait, I just tried it in a gNewSense 4 VM, and I can reproduce the problem there. I may just be able to figure this out after all. – Freeway 5/5, 2020 at 23:40

@joseph Was about to redirect the answer but, yeah great. – Drier 5/5, 2020 at 23:41

@Tobitobiah was right about needing to zero %al. Do that and it works. Full answer and explanation coming shortly. – Freeway 6/5, 2020 at 0:3

tl;dr: do xorl %eax, %eax before call printf.

printf is a varargs function. Here's what the System V AMD64 ABI has to say about varargs functions:

For calls that may call functions that use varargs or stdargs (prototype-less calls or calls to functions containing ellipsis (. . . ) in the declaration) %al¹⁸ is used as hidden argument to specify the number of vector registers used. The contents of %al do not need to match exactly the number of registers, but must be an upper bound on the number of vector registers used and is in the range 0–8 inclusive.

You broke that rule. You'll see that the first time your code calls printf, %al is 10, which is more than the upper bound of 8. On your gNewSense system, here's a disassembly of the beginning of printf:

printf:
   sub    $0xd8,%rsp
   movzbl %al,%eax                # rax = al;
   mov    %rdx,0x30(%rsp)
   lea    0x0(,%rax,4),%rdx       # rdx = rax * 4;
   lea    after_movaps(%rip),%rax # rax = &&after_movaps;
   mov    %rsi,0x28(%rsp)
   mov    %rcx,0x38(%rsp)
   mov    %rdi,%rsi
   sub    %rdx,%rax               # rax -= rdx;
   lea    0xcf(%rsp),%rdx
   mov    %r8,0x40(%rsp)
   mov    %r9,0x48(%rsp)
   jmpq   *%rax                   # goto *rax;
   movaps %xmm7,-0xf(%rdx)
   movaps %xmm6,-0x1f(%rdx)
   movaps %xmm5,-0x2f(%rdx)
   movaps %xmm4,-0x3f(%rdx)
   movaps %xmm3,-0x4f(%rdx)
   movaps %xmm2,-0x5f(%rdx)
   movaps %xmm1,-0x6f(%rdx)
   movaps %xmm0,-0x7f(%rdx)
after_movaps:
   # nothing past here is relevant for your problem

A quasi-C translation of the important bits is goto *(&&after_movaps - al * 4); (see Labels as Values). For efficiency, gcc and/or glibc didn't want to save more vector registers than you used, and it also doesn't want to do a bunch of conditional branches. Each instruction to save a vector register is 4 bytes, so it takes the end of the vector register saving instructions, subtracts al * 4 bytes, and jumps there. This results in just enough of the instructions executing. Since you had more than 8, it ended up jumping too far back, and landing before the jump instruction it just took, thus creating an infinite loop.

As for why it's not reproducible on modern systems, here's a disassembly of the beginning of their printf:

printf:
   sub    $0xd8,%rsp
   mov    %rdi,%r10
   mov    %rsi,0x28(%rsp)
   mov    %rdx,0x30(%rsp)
   mov    %rcx,0x38(%rsp)
   mov    %r8,0x40(%rsp)
   mov    %r9,0x48(%rsp)
   test   %al,%al          # if(!al)
   je     after_movaps     # goto after_movaps;
   movaps %xmm0,0x50(%rsp)
   movaps %xmm1,0x60(%rsp)
   movaps %xmm2,0x70(%rsp)
   movaps %xmm3,0x80(%rsp)
   movaps %xmm4,0x90(%rsp)
   movaps %xmm5,0xa0(%rsp)
   movaps %xmm6,0xb0(%rsp)
   movaps %xmm7,0xc0(%rsp)
after_movaps:
   # nothing past here is relevant for your problem

A quasi-C translation of the important bits is if(!al) goto after_movaps;. Why did this change? ~~My guess is Spectre. The mitigations for Spectre make indirect jumps really slow, so it's no longer worth doing that trick.~~ Or not; see comments. Instead, they do a much simpler check: if there's any vector registers, then save them all. With this code, your bad value of al isn't a disaster, since it just means the vector registers will be unnecessarily copied.

Freeway answered 6/5, 2020 at 0:33 Comment(11)

The mitigations for Spectre make indirect jumps really slow - only slow if you armor them with lfence or something, which GCC doesn't do in general by default. I think this change predated Spectre; probably just because indirect branches are harder to predict, and FP printf is rare enough than dumping extra registers when you have one FP arg doesn't have much cost. (Especially on modern CPUs with good OoO exec and large store buffers.) Interesting discovery; I didn't know gcc variadic code-gen every did anything other than check AL!=0. – Sanalda 6/5, 2020 at 5:18

Another effect of this is that a bogus AL can't crash by jumping too far. So it's more robust against buggy hand-written code. IDK if that was any motivation at all. It also saves instructions in the no-FP fast path, just test %al,%a / jz instead of multiple ALU instructions to calculate a jump target. Seems like a good change to me regardless of Spectre. – Sanalda 6/5, 2020 at 5:21

The TL;DR line worked indeed. An interesting follow up is that the slightly different program onlinegdb.com/r1Yd5py9I when with a greater than 8 value to be printed (by adding 5 to any of the summed values) it goes invalid operation instead of infinite loop this time. I wonder why. – Drier 6/5, 2020 at 5:34

@Drier Since the problem is it's jumping wildly, with values other than 10, it's probably ending up jumping to halfway inside of some instruction that doesn't happen to be some other valid instruction, and is thus getting SIGILL Illegal Instruction. – Freeway 6/5, 2020 at 5:36

So, to wrap it up, we have a gNewSense issue here? Because in onlineGDB and in my colleagues/teacher Fedora it works just fine. – Drier 6/5, 2020 at 5:40

@Drier No, it's not an issue with gNewSense. It was an issue with your code. Your code broke one of the rules of the ABI, and it just so happens that newer systems are more lenient about the rule you broke than older ones are (i.e., on newer systems it's just slightly slower instead of completely broken). – Freeway 6/5, 2020 at 5:42

@Ajna: It's not rare for buggy asm code to work by accident / happen to work. Other ABI violations like modifying a call-preserved register also often don't cause a problem with simple callers, but will break other code. Throwing code at the wall and seeing what sticks works even less well in asm than in other languages. Don't depend on trial and error. (Although it can find things that definitely don't work, e.g. like here where it breaks on one test system.) – Sanalda 6/5, 2020 at 10:39

@PeterCordes I don't believe a top 3 national and top 1 private computer science college code would be throwing code at the wall or depend in trial and error, but ok, noted. – Drier 6/5, 2020 at 14:13

@Ajna: Is that where the ABI-violating code in the question was from? You didn't say that until now, but I guess that explains why you kept thinking it must be a bug in gNewSense even after the bug in that code was explained. Bugs do happen by accident even when you know what you're doing and just forget something. For the same reasons intentional trial and error is unsafe, it's easy to miss such bugs when testing on systems where it happens to work. Often a good idea to start with or compare against C compiler output; compilers don't make mistakes in following the calling convention. – Sanalda 6/5, 2020 at 14:31

@PeterCordes fun fact: the code in the exercise following this one has right before call printf a new 'movl $0 %eax' attached to it :P – Drier 9/5, 2020 at 23:35

@JosephSible-ReinstateMonica: Re: efficiency advantages of the test/jz way over the computed-jump way: I wrote a big footnote about that in an answer to Why does printf still work with RAX lower than the number of FP args in XMM registers?. Not exactly a duplicate, but the answer basically has to explain the same details. – Sanalda 24/4, 2022 at 4:4

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags