Understanding stack alignment

Asked 8/2, 2018 at 11:3 Answered 8/2, 2018 at 11:22

Solved assembly x86-64 memory-alignment calling-convention abi

I'm reading Intel manual about Stack Frames. It was noted that

The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary.

I don't quite understand what it means. Does it mean that rsp should point to the address that is always aligned on 16?

I tried to experiment with it and wrote very simple program:

section .text
    global _start

_start:
    push byte 0xFF

    ;SYS_exit syscall

I ran it with gdb and noted that before executing the push instruction rsp = 0x7fffffffdcf0. And it was really aligned on 16. x/1xg $rsp returned 0x0000000000000001.

Now, after pushing the content of rsp became 0x7fffffffdce8. Is it a violation of the alignment requirements?

And what I also noticed x/1xg $rsp returned 0xffffffffffffffff. It means we set 1 to the next 8 bytes, not just one specified in the push instruction. Why? I expected the output of x/1xg $rsp after pushing to be 0x00000000000000FF (we pushed just one byte).

Kegan answered 8/2, 2018 at 11:3 Comment(6)

From the description of push in Intel's manual: "Operand size. The D flag in the current code-segment descriptor determines the default operand size; it may be overridden by instruction prefixes (66H or REX.W). The operand size (16, 32, or 64 bits) determines the amount by which the stack pointer is decremented (2, 4 or 8). If the source operand is an immediate of size less than the operand size, a sign-extended value is pushed on the stack." Hence why you get 0xffffffffffffffff pushed onto the stack. – Marielamariele 8/2, 2018 at 11:7

@Marielamariele It explains the value of the stack. Thanks. But why was it sign extented to quad-word, not to just word. I mean why do we have 0xFFFFFFFFFFFFFFFF, but not 0x000000000000FFFF? – Kegan 8/2, 2018 at 11:10

Antario: Because it's a 64-bit program, and hence the operand size mentioned in the description I pasted is 64 bits, unless you override it somehow. – Marielamariele 8/2, 2018 at 11:40

Upvoted because you read the docs and tried stuff yourself with gdb. – Trefor 8/2, 2018 at 12:51

Just a minor nitpick: the document linked is not the Intel's manual. It's the SYS V ABI, it's a conjoined effort. There is a version where Intel added the conventions for the bnd registers but still, Intel is no the main author of the document (and hopefully, it never will). – Lebeau 8/2, 2018 at 13:57

That's not an Intel manual. The x86-64 System V ABI is maintained by a collaboration of developers, not published by Intel. See #18134312. – Menial 10/2, 2018 at 2:33

rsp % 16 == 0 at _start - that's the OS entry point. It's not a function (there's no return address on the stack, instead RSP points at argc). Unlike functions, RSP is aligned by 16 on entry to _start, as specified by the x86-64 System V ABI.

From _start, you're ready to call a function right away, without having to adjust the stack, because the stack should be aligned before call. call itself will add 8B of return address, and you can expect the rsp % 16 == 8 upon entry, one more push away from 16-byte alignment. That's guaranteed upon entry to any function¹.

Upon app entry, you can trust the kernel to give you 16-byte RSP alignment, or you could align the stack manually with and rsp, -16 before calling any other code conforming to ABI. (Or if you plan to use C runtime lib, then the entry point of your app code should be main, and let libc's crt startup code code run as _start. main is a normal function like any other, so RSP & 0xF == 0x8 on entry to it when it's eventually called.)

Footnote 1: Unless you build with special options that change the ABI, like -mpreferred-stack-boundary=3 instead of the default 4. But that would make it unsafe to call functions in any code compiled without that. For example glibc scanf Segmentation faults when called from a function that doesn't align RSP

Now, after pushing the content of rsp became 0x7fffffffdce8. Is it a violation of the alignment requirements?

Yes, if you would at that point call some more complex function like for example printf with non trivial arguments (so it would use SSE instruction for implementation), it will highly likely segfault.

About push byte 0xFF:

That's not legal instruction in 64b mode (not even in 16 and 32 bit modes) (not legal in the sense of byte operand target size, byte immediate as source value is legal, but operand size can be only 16, 32 or 64 bits), so the NASM will guess the target size (any from legal ones, naturally picking qword in 64b mode), and use the guessed target size with the imm8 from source.

BTW use -w+all option to make the NASM emit (sort of weird, but at least you can investigate) warning in such case:

warning: signed byte value exceeds bounds

For example legit push word 0xFF would push only two bytes to stack, of word value 0x00FF.

How to align the stack: if you already know initial alignment, just adjust as needed before calling some ABI requiring subroutine (in common 64b code that is usually as simple as either not pushing anything, or doing one more redundant push, like push rbp).

If you are not sure about alignment, use some spare register to store original rsp (often rbp is used, so it also functions as stack frame pointer), and then and rsp,-16 to clear the bottom bits.

Keep in mind, when creating your own ABI conforming subroutines, that stack was aligned before call, so it is -8B upon entry. Again simple push rbp is often enough to resolve several issues at the same time, preserving rbp value (so mov rbp, rsp is possible "for free") and aligning stack for rest of subroutine.

EDIT: about encoding, source size, and immediate size...

Unfortunately I'm not 100% sure about how exactly this is supposed to be defined in NASM, but I think actually the push definition is so complex, that it breaks NASM syntax a bit (exhausting the current syntax to a point where you can't specify whether you mean operand size, or source immediate size, so it is silently assumed the size specifier is operand size mainly and affects immediate in certain cases).

By using push byte 0xFF the NASM will take the byte part ALSO as "operand size", not just as immediate size. And byte is not legal operand size for push, so NASM will instead choose qword as by default in 64b mode. Then it will also consider the byte as immediate size, and sign-extend the 0xFF to qword. I.e. this looks to me as a bit of undefined behaviour. NASM creators probably don't expect you to specify immediate size, because the NASM optimizes for size, so when you do push word -1, it will assemble that as "push word operand imm8". You can override that the other way, to make sure you get imm16 by push strict word -1.

See the machine code produced by the various combinations (in 64b mode) (some of them speaking strictly are worth at least of warning, or even error, like "strict qword" producing only imm32, not imm64 (as imm64 opcode does not exist of course) ... not even mentioning that the dword variants are effectively qword operand sizes, you can't use 32b operand size in 64b mode):

 6 00000000 6AFF                            push    -1
 7 00000002 6AFF                            push    strict byte 0xFF
 8          ******************       warning: signed byte value exceeds bounds
 9 00000004 6AFF                            push    byte 0xFF
10          ******************       warning: signed byte value exceeds bounds
11 00000006 6AFF                            push    strict byte -1
12 00000008 6AFF                            push    byte -1
13 0000000A 6668FF00                        push    strict word 0xFF
14 0000000E 6668FF00                        push    word 0xFF
15 00000012 6668FFFF                        push    strict word -1
16 00000016 666AFF                          push    word -1
17 00000019 68FF000000                      push    strict dword 0xFF
18 0000001E 68FF000000                      push    dword 0xFF
19 00000023 68FFFFFFFF                      push    strict dword -1
20 00000028 6AFF                            push    dword -1
21 0000002A 68FF000000                      push    strict qword 0xFF
22 0000002F 68FF000000                      push    qword 0xFF
23 00000034 68FFFFFFFF                      push    strict qword -1
24 00000039 6AFF                            push    qword -1

Anyway, I guess not too many people are bothered by this, as in 64b mode you usually want qword push (rsp -= 8) with immediate encoded in shortest possible way, so you just write push -1 and let the NASM handle the imm8 optimization itself, expecting rsp to change by -8 of course. And in other case, they probably expect you to know legal operand sizes, and not to use byte at all.

If you think this is not acceptable, I would raise this on the NASM forum/bugzilla/somewhere, how it is supposed to work exactly. As far as I'm personally concerned, the current behaviour is "good enough" for me (makes both sense, plus I give quick look to listing file from time to time to verify there's no nasty surprise in the machine code bytes and it landed as expected). That said, I mostly code size intros, so I know about every byte produced and it's purpose. If the NASM would suddenly produce imm16 instead of expected imm8, I would see it on the binary size and investigate.

Waldowaldon answered 8/2, 2018 at 11:22 Comment(21)

"so the NASM will guess the target size (any from legal ones, naturally picking qword in 64b mode), and sign-extend the value." Why would NASM do any sign-extension here? The encoding 6A ib (push imm8) is perfectly valid in 64-bit mode. There will be sign-extension, but there's no need for the assembler to do it. – Marielamariele 8/2, 2018 at 11:43

@Marielamariele yup, I wrote that before checking the Intel docs myself, so this part of my answer is misleading, you are correct. Only the target operand size is guessed, immediate is used as-is. – Waldowaldon 8/2, 2018 at 11:45

@Waldowaldon before checking the Intel docs myself Can you please point out the part of the docs? – Kegan 8/2, 2018 at 13:25

@Kegan "The operand size (16, 32, or 64 bits) determines the amount by which the stack pointer is decremented (2, 4 or 8). If the source operand is an immediate of size less than the operand size, a sign-extended value is pushed on the stack." - in that shortened web version. Not sure where it is in original Intel docs, but I guess the wording will be same, so using text search over Intel's pdfs should work. (by Intel docs I mean the instruction reference guide, I see you are calling "System V ABI" as "Intel manual", while that has not much to do with Intel, except being target platf.) – Waldowaldon 8/2, 2018 at 13:26

@Waldowaldon Kind of off-topic but anyway. Do you have any idea about that alignment requirements? Historical reason or performance... or something else? – Kegan 8/2, 2018 at 15:0

@Kegan performance of course, some SSE/SSE2 instructions need memory operands aligned, and as C and C++ do use stack heavily for local variables storage, it's simpler to avoid and rsp,... aligning everywhere by simply already produce code which works with stack in aligned way. Keep in mind this is super easy for compiler to keep track of it and handle it +- "for free", only rarely the unaligned code would have significantly fewer prologue/epilogue instructions. Of course inside code using SSE for FP and vectorization is then pure win. But this makes x86-64 a bit harder for humans. – Waldowaldon 8/2, 2018 at 16:53

@Kegan but honestly, with current C++ compilers you must do something extraordinary (billions of iterations) to really need write anything in assembly. Most of the time it is enough to do some small hand-holding for the compiler and keep the source in C++, as long as we are not talking about some crazy demanding stuff like AAA games, or CFD in formula 1 team, the 5-15% perf loss is affordable almost everywhere else, even things like autonomous cars have now serious HW performance, as long as the algorithm a resolutions of data are reasonable, the last 5-15% are negligible. – Waldowaldon 8/2, 2018 at 16:56

@Kegan Especially as you gain on the front of robustness/correctness and maintenance. And the SYS V ABI is relatively fresh, so it is not encumbered by any historical legacy, quite contrary, the ABI is completely different from 32b x86, designed specifically to incorporate all the experience from the two decades of 32b stack calling convention usage. – Waldowaldon 8/2, 2018 at 17:0

@Waldowaldon I am a bit confused by what was said in the intel instructions reference. If the source operand is an immediate of size less than the operand size, a sign-extended value is pushed on the stack. Sign extended to what size? 64-bit (because it's 64-bit program)? Also like @Marielamariele said 6A ib is a valid 64-bit instruction, but in the intel reference said that For instructions in which imm8 is combined with a word or doubleword operand, the immediate value is signextended to form a word or doubleword. So imm8 in my push byte 0xFF should be extended to the word size... Whatswrong? – Kegan 9/2, 2018 at 11:54

@Kegan is the addendum to answer understandable? Basically I'm admitting I don't know myself, and in my own usage I rather go by checking the resulting machine code to verify it works as expected (and in 64b mode I don't want to use other than 64b operand size any way, so this quirk(?) of NASM syntax is not bothering me). – Waldowaldon 9/2, 2018 at 12:55

@Ped7g: You don't need to manually align the stack on process entry: rsp is 16-byte aligned at _start, and the ABI guarantees that. (It sounds like your first paragraph is saying you expect rsp+8 to be aligned at _start, but _start isn't a function that's called` by anything. The process-entry state is different from the function-call state. So if the kernel follows the ABI (which it does), then the OP's code would violate the ABI if it called any (non-private) function after misaligning the stack by doing a push.) – Menial 10/2, 2018 at 2:41

@St.Antario: The immediate is sign-extended to the operand-size. Of course it's not sign-extended only to word when the operand-size is qword. push byte 0xff is weird thing to write, BTW. byte there would normally specify the operand-size for the instruction, not the width of the immediate. Use push strict byte 0xff to specify the width of the immediate in the machine encoding. (Although apparently NASM and YASM interpret push byte 0xff that way, too, because byte operand-size for push isn't available.) e.g. push word 1 is 66 6A 01, push strict word 1 is 66 68 01 00 – Menial 10/2, 2018 at 2:47

@PeterCordes Now I unederstood the mechanic. If we have 0x66 instruction prefix in 64b mode it means the operand size is overriden to 16 and the value will be size extended to match the operand size. I tried to experiment with it to get the instruction like 6668FFFFFFFF but did not succeed. Is the instruction possible on x86-64? – Kegan 12/2, 2018 at 14:55

@Kegan In x86-64 the 66 68 FF FF is push word 0xFFFF (valid instruction) .. the additional two FF FF will be interpreted as next opcode, which is invalid combination. in NASM you can get that opcode by using push strict word -1, see the listing line in answer starting with 15 00000012 ... ... so I'm not sure whether you didn't understand what that listing represents, or what do you mean by 6668FFFFFFFF .. you can't tell CPU how long the opcode is, the opcode has it's desired length, defined by the opcode itself, i.e. 6668 is push imm16 in 64b mode, so +2B more to read. – Waldowaldon 12/2, 2018 at 15:7

@Waldowaldon Not really. I meant the push imm32 (68FFFFFFFF) instruction with operand size prefix (66). In total 6668FFFFFFFF. It's not specified in the reference (at least I did not find) that operand size cannot be overriden for instructions like push imm32. All they said is

Operand size. The D flag in the current code-segment descriptor determines the default operand size; it may be overridden by instruction prefixes (66H or REX.W)

So 6668FFFFFFFF seemed valid to me.. – Kegan 12/2, 2018 at 15:9

@St.Antario: there's no such thing. See again the listing, I used there all combinations I could thought of, should be quite complete for x86-64 ... with the catch, that the dword push instructions are 64 bit in effect. There's no way to push/pop 32 bit value on stack in 64b mode. Only 64 or 16 bits. (not counting workarounds like mov eax,[rsp] add rsp,4). ..... EDIT: the 66 prefix is valid, but the effect is "16 bit", not "32 bit", in case of push imm. And it is about operand size, not imm size (that depends on opcode 68 vs 6A). – Waldowaldon 12/2, 2018 at 15:13

@Waldowaldon There's no way to push/pop 32 bit value on stack in 64b mode Yes, I understand that. They will be sign-extended to the operand size (in instructions like push imm32). But I'm asking about a bit different things... Is it possible to add operand-size prefix to all instructions? Even to push imm32. I already saw 6668FFFF in the listing you posted (push imm16 with operand size prefix) – Kegan 12/2, 2018 at 15:15

@Kegan Hmm.. maybe other way.... 68 push opcode in 64b mode is by default reading 4 bytes (imm32), and will push sign-extended 8 bytes (64b) to stack, and adjust rsp by -8. If you will prefix it with 66, it will read only two bytes, and store two bytes into stack, adjusting rsp by -2. There's no push imm opcode in 64b to adjust rsp by -4, or to use imm64 as source value, i.e. you can use as source value imm8/16/32 (6A/66 68/68) and operand size 16 or 64 (66 prefix or no prefix). – Waldowaldon 12/2, 2018 at 15:24

@St.Antario: So you want to decrement RSP by 8 and store 0x00000000FFFFFFFF to [rsp]? I'd probably use mov eax, -1 / push rax. 6 bytes, 2 uops. That way you get a 32-bit immediate zero-extended to 64 bits for free (instead of sign-extended 32 or an actual 64-bit immediate if you'd used 64-bit operand size). – Menial 12/2, 2018 at 16:12

@Waldowaldon I'm completely confused now. You said 68 push opcode in 64b mode is by default reading 4 bytes. But in NASM docs, section 3.7 they provided the following example for 16-bit mode 66 68 21 00 00 00 which also seems to read 4 bytes, but in 16-bit mode. – Kegan 13/2, 2018 at 8:45

@Kegan yes, but that's a) prefixed with 66. In 64b mode that will make it 16 bit (2 bytes immediate). b) in 16-bit mode, that's like completely different CPU. There are basically three different CPUs in modern x86, 16/32/64 bit, majority of instruction opcodes overlap, but quite some are different. Like in this case 66 68 means something completely different in each mode. (push imm32/imm16/imm16 for 16/32/64 modes). Or consider mov ax,bx encoding (in 16b mode)... in 32 bit mode that opcode will execute mov eax,ebx instead) – Waldowaldon 13/2, 2018 at 8:53

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags