As a general rule, you cannot expect two different compilers to generate the same assembly code for the same input, even if they have the same version number; they could have any number of extra "patches" to their code generation. As long as the observable behavior is the same, anything goes.
You should also know that GCC, in its default -O0
mode, generates intentionally bad code. It's tuned for ease of debugging and speed of compilation, not for either clarity or efficiency of the generated code. It is often easier to understand the code generated by gcc -O1
than the code generated by gcc -O0
.
You should also know that the main
function often needs to do extra setup and teardown that other functions do not need to do. The instruction leal 4(%esp),%ecx
is part of that extra setup. If you only want to understand the machine code corresponding to the code you wrote, and not the nitty details of the ABI, name your test function something other than main
.
(As pointed out in the comments, that setup code is not as tightly tuned as it could be, but it doesn't normally matter, because it's only executed once in the lifetime of the program.)
Now, to answer the question that was literally asked, the reason for the appearance of
call __x86.get_pc_thunk.ax
is because your compiler defaults to generating "position-independent" executables. Position-independent means the operating system can load the program's machine code at any address in (virtual) memory and it'll still work. This allows things like address space layout randomization, but to make it work, you have to take special steps to set up a "global pointer" at the beginning of every function that accesses global variables or calls another function (with some exceptions). It's actually easier to explain the code that's generated if you turn optimization on:
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %ecx
This is all just setting up main
's stack frame and saving registers that need to be saved. You can ignore it.
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
The special function __x86.get_pc_thunk.bx
loads its return address -- which is the address of the addl
instruction that immediately follows -- into the EBX register. Then we add to that address the value of the magic constant _GLOBAL_OFFSET_TABLE_
, which, in position-independent code, is the difference between the address of the instruction that uses _GLOBAL_OFFSET_TABLE_
and the address of the global offset table. Thus, EBX now points to the global offset table.
call add@PLT
Now we call add@PLT
, which means call add
, but jump through the "procedure linkage table" to do it. The PLT takes care of the possibility that add
is defined in a shared library rather than the main executable. The code in the PLT uses the global offset table and assumes that you have already set EBX to point to it, before calling an @PLT symbol. That's why main
has to set up EBX even though nothing appears to use it. If you had instead written something like
extern int number;
int main(void) { return number; }
then you would see a direct use of the GOT, something like
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
movl number@GOT(%ebx), %eax
movl (%eax), %eax
We load up EBX with the address of the GOT, then we can load the address of the global variable number
from the GOT, and then we actually dereference the address to get the value of number
.
If you compile 64-bit code instead, you'll see something different and much simpler:
movl number(%rip), %eax
Instead of all this mucking around with the GOT, we can just load number
from a fixed offset from the program counter. PC-relative addressing was added along with the 64-bit extensions to the x86 architecture. Similarly, your original program, in 64-bit position-independent mode, will just say
call add@PLT
without setting up EBX first. The call still has to go through the PLT, but the PLT uses PC-relative addressing itself and doesn't need any help from its caller.
The only difference between __x86.get_pc_thunk.bx
and __x86.get_pc_thunk.ax
is which register they store their return address in: EBX for .bx
, EAX for .ax
. I have also seen GCC generate .cx
and .dx
variants. It's just a matter of which register it wants to use for the global pointer -- it must be EBX if there are going to be calls through the PLT, but if there aren't any then it can use any register, so it tries to pick one that isn't needed for anything else.
Why does it call a function to get the return address? Older compilers would do this instead:
call 1f
1: pop %ebx
but that screws up return-address prediction, so nowadays the compiler goes to a little extra trouble to make sure every call
is paired with a ret
.
-fno-pie
to the command line. If that works, I will explain. – Rickettsialea
is part of gcc's very clunky code to alignesp
by 16. It does this only inmain
in 32-bit code, or if it need more than 16-byte alignment for locals in other functions. gcc copies the return address to just above the frame pointer, and does something over-complicated withecx
for no reason, saving/restoring a pointer to 4 bytes above the%esp
on entry.LEA
is not complicated – Improvisator