What does "aligning the stack" mean in assembly?

M

3

10

How does stack alignment work in ASMx64? When do you need to align the stack before a function call and how much do you need to subtract?

I didn't understand what was the purpose of it. I know there are other posts about this but it wasn't clear enough for me. For example:

extern foo
global bar

section .text
bar:
  ;some code...
  sub  rsp, 8     ; Why 8 (I saw this on some posts) ? Can it be another value ? Why do we need to substract?
  call foo        ; Do we need to align stack everytime we call a function?
  add  rsp, 8
  ;some code...
  ret

Magnetohydrodynamics answered 7/11, 2020 at 15:0 Comment(5)

Maybe you should start by learning about "memory alignment". – Guenon 7/11, 2020 at 15:3

Perhaps you should explain in more detail what you read and what wasn't clear. Otherwise you are likely to just get votes to close as duplicate, or people rewriting the same things you have already read. – Euphoria 7/11, 2020 at 15:6

The AMD64 system V ABI (and the Microsoft 64-bit ABI) require such alignment. At the point an ABI compliant function is called with call the stack is to be aligned on a 16 byte boundary. When the first instruction of a function is reached the stack will be misaligned by 8 bytes because the return address is pushed on the stack by the call instruction. To get the stack aligned back on an 16 byte boundary you can do so by subtracting 8 from RSP (stack pointer) or pushing a 64 bit register on the stack. – Introgression 7/11, 2020 at 15:7

@MichaelPetch Thanks, so "stack aligned" means that the adress of rsp is a multiple of 16? – Magnetohydrodynamics 7/11, 2020 at 15:36

Yes, if the value in RSP is evenly divisible by 16 the stack is aligned on a 16 byte boundary. – Introgression 7/11, 2020 at 15:48

E

10

Addressing is generally byte-based. A unique address points at a byte (which can be the first byte in a word or doubleword, etc, but referenced to that address).

With any numbering system the least significant digit holds the value base to the power 0 (the number 1). The next least base to the power 1, the next base to the power 2. In decimal this is the ones column the tens column the hundreds column. In binary ones, twos, fours... Alignment means evenly divisible by which also means the least significant digits are zeros.

You are always "aligned" on a byte boundary but a 16 bit boundary in binary means the least significant bit is zero, 32 bit aligned two zeros and so on.

0x1234 aligned on both a 16 and 32 bit boundary but not 64 bit
0x1235 not aligned (byte alignment really isn't a thing)
0x1236 aligned on a 16 bit boundary
0x1230 four zeros so 16, 32, 64, 128 BITS not bytes. 2,4,8,16 bytes.

The why is for performance reasons all memories have a fixed width as well as data buses, you can't magically add or remove wires in the logic once implemented, there is a physical limit, you can choose to not use all of them as part of the design but you can't add any.

So while the x86 buses are wider, let's say you had a 32 bit wide data bus as well as a 32 bit wide memory (think cache but also dram but we don't access dram directly in general).

If I want to save the 16 bits 0xAABB to address 0x1001 in a little endian machine then 0x1001 will get 0xBB and 0x1002 will get 0xAA. If I had a 32 bit data bus and a 32 bit memory on the far side of it then I could move those 16 bits if I designed the bus for this, by writing 0xXXAABBXX to address 0x1000 with a byte lane mask of 0b0110 telling the memory controller to use the 32 bits of memory associated with the BYTE based address 0x1000, and the byte lane mask on the bus telling the controller only save the middle two bytes, the outer two are don't cares.

The memory is a fixed width generally so all transactions must be full width it would read the 32 bits modify the 16 in the middle with 0xAABB and write the 32 bits back. This is of course inefficient. Even worse would be to write 0xAABB to 0x1003 that would be two bus transactions one for 0xBBXXXXXX at address 0x1000 and one for 0xXXXXXXAA at address 0x1004. That is a lot of extra cycles both on the bus and the read-modify-writes on the memory.

Now the stack alignment rules are not going to prevent read-modify-writes on writes. For the cases where larger transfers happen there are opportunities for a performance gain, for example if the bus were 32 bits and the memory and you did a 64 bit transfer to address 0x1000, that can based on bus design look like a single transfer with a length of two. The bus handshake happens then two back to back clocks the data moves, rather than handshakes and one width of the bus of data for a smaller transfer. So you get a gain there if the memory is 32 bits wide then it is two writes without a read-modify-write into the sram in the cache. Pretty clean, want to avoid the read-modify-writes.

Now do this for a while as things evolve and the hardware and the tools desire a stack alignment.

Depending on the instruction set, clearly here you are asking x86, but as a programmer you can sometimes choose to say push a byte on the stack and then adjust it to align it. Or if you are making room for local variables, depending on the instruction set (if the stack pointer is general purpose enough to be able to do math on it) you can simply subtract, so sub sp,#8 is the same as pushing two 32 bit items to the stack simply to make room for two 32 bit items.

If the rule is say 32 bit alignment and you push a byte, then you need to adjust the stack pointer by 3 to make the total change in the stack pointer a multiple of 4 bytes (32 bits).

How you know how much is you simply count it up. If it is 16 byte alignment and you push 4 then you need to push 12 more or adjust the stack pointer by 12 more.

The key here is that if everyone agrees to keep the stack aligned then you don't actually have to look at the lower bits of the stack pointer, you just keep track of what you are pushing and popping before calling something else.

If the stack is shared with the interrupt handlers (not really in your current x86 running an operating system, but still possible and possible in many other use cases for general purpose processors) I have not seen that this rule applies there as you will see the compiler do a less than aligned size push or pop then adjust with other pushes or pops or subtraction or addition. If an interrupt occurred between those the handler would see an unaligned stack.

Some architectures will fault on unaligned accesses, a further reason for keeping the stack aligned.

If your code is not messing with the stack then you don't need to mess with the stack (pointer). Only if you use the stack in your code by allocating space on the stack (pushes or math on the stack pointer), do you need to care and you need to know what the convention of the compiler you are linking this code with and conform to that. If this is all assembly language and no compiler then you decide the convention yourself and basically do whatever you want within the limitations of the processor itself.

From your title question it has nothing to do with assembly at all, nor machine code. It has to do with your code and what it does. The assembly language is simply a language in which you convey how much you want to adjust the stack pointer, the instruction doesn't care or know about any such things it takes the constant provided and uses it against the register. Assembly is one of the few if not the only that allows you to do math on the stack pointer register, so there is that connection. But alignment and assembly are not related.

Enactment answered 7/11, 2020 at 16:0 Comment(0)

R

13

When do you need to align the stack before a function call and ....?

You need to align the stack when the function you're calling expects an aligned stack.

Functions that were written in other languages (e.g. C), and functions that are written in assembly but are designed to be called from other languages, will comply with some kind of calling convention (which includes much more than just stack alignment - how parameters are passed, where parameters are, things like "red zone", etc); and for 64-bit 80x86 the 2 common calling convention expect the stack to be aligned to a 16-byte boundary.

In a "pure assembly" project where you're calling functions that were written in assembly for assembly callers; the programmer is free to do whatever they like (e.g. whatever is best for performance) without caring about the limitations/restrictions of other languages that reduce performance (calling conventions). In this case you may never need to align the stack at all (but if you're dealing with AVX-512 a function might want the stack aligned to 64 bytes, and if you're dealing with AVX2 a function might want the stack aligned to 32 bytes, and ..).

... and how much do you need to substract?

If you don't know if the stack was aligned enough; then aligning the stack is typically done with AND (e.g. maybe and rsp,0xFFFFFFFFFFFFFFF0 to align the stack to a 16-byte boundary). This also means that you need to store the old stack pointer somewhere so that you can restore it; which often means 4 more instructions (push rbp, mov rbp,rsp before the alignment, then mov rsp,rbp and pop rbp to restore things later).

However; if you know that your caller aligned the stack for you (and that functions you call want the same or less alignment), then you can calculate how much extra to subtract by keeping track of how much you pushed on the stack. For example, if the stack was aligned to 32 bytes by your caller, and you push four 64-bit (8 byte) values on the stack and a call instruction will push another 64-bit value (return address); then it'd be a total of 5*8 = 40 bytes; so you'd know you need to subtract another 8 bytes to make the total 48 bytes if a you want to align to 16 bytes, or subtract another 24 bytes to make the total 64 bytes if you want to align to 32 bytes. This also avoids the need to save the original stack pointer (you can add whatever you subtracted later) so it can save 4 instructions.

Of course (for "pure assembly") you'd look at the requirements of all the functions you call and pick the worst case and align the stack to that once (and avoid aligning the stack multiple times differently, once for each function you call); and you might say "my function requires the stack to be aligned to whatever the worst case is for the functions I call" to ensure that you can calculate how much to subtract (and avoid the more expensive "AND with ..." approach). However (for "pure assembly") this places the burden on your caller (who may place the burden on their caller, who may....) so it can make performance worse (all of the ancestors in the call chain have to do extra work so you can avoid less work). In other words; for "pure assembly"; achieving the highest efficiency/performance takes a lot of work (to determine if/when stack should be aligned by how much and minimize the expense of ensuring stack is aligned where necessary).

This is also part of why compilers put the alignment in their calling conventions - a required "unlikely to be optimal most of the time" standard alignment makes it easier for the compiler.

Raby answered 7/11, 2020 at 16:5 Comment(4)

If only a few functions ever need more than 8-byte stack alignment, you might still choose to only maintain 8-byte stack alignment and have those functions use and rsp, -32 or whatever when they want aligned local arrays. (They'd also need to set up RBP as a frame pointer, or do something else to make it possible to restore the old RSP, though.) Taking the largest alignment you ever want and maintaining that throughout your whole program could end up being more expensive, especially if those more-aligned functions aren't called very often. – Auroora 8/11, 2020 at 3:43

The choice of both Windows x64 and x86-64 System V to maintain 16-byte stack alignment is pretty good, and allows aligned spill/reload of XMM registers, and more efficient auto-vectorization of legacy-SSE loops over local arrays or single objects. Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment? At a cost of just 1 dummy push/pop (or sub/add) and at most 8 bytes of wasted space per stack frame. Some of the benefit is specific to compilers, not hand-written, e.g. maintaining alignof(long double), but 16B is nice. – Auroora 8/11, 2020 at 3:46

@PeterCordes: From my perspective; the existence of an ABI is an admission of failure - an ugly consequence of ancient work-arounds for "not enough memory to have compiler + whole program in memory at the same time" combined with 50+ years of tools failing to modernize. It's why my answer separates "pure assembly" (free from the failures of compilers) from cases where you do have to put up with an ABI. – Raby 8/11, 2020 at 5:22

Sure, that's a good way of looking at things. From that PoV, there should be a way for each function in a shared library to indicate its calling convention, and static libraries should have LTO bytecode, not machine code. But nevertheless, in "pure asm" you were still talking about how much stack alignment to maintain across calls, even if you fully customize the calling convention on a per-function basis. Maintaining a large alignment everywhere may suck. I only brought up ABIs to discuss their choice of 16B being maybe good for pure asm (not just for their more rigid constraints). – Auroora 8/11, 2020 at 5:28