Is the Microsoft Stack always aligned to 16-bytes?
Asked Answered
M

1

11

In Assembly Language, Seventh Edition for x86 Processors by Kip Irvine, on page 211, it says under 5.53 The x86 Calling Convention which addresses the Microsoft x64 Calling Convention,

  1. When calling a subroutine, the stack pointer (RSP) must be aligned on a 16-byte boundary (a multiple of 16). The CALL instruction pushes an 8-byte return address on the stack, so the calling program must subtract 8 from the stack pointer, in addition to the 32 it already subtracts for the shadow space.

It goes on to show some assembly with a sub rsp, 8 right before the sub rsp, 20h (for the 32-bytes of shadow space).

Is this a safe convention though? Is the Microsoft stack guaranteed to be aligned on 16-bytes before the CALL instruction? Or, is the book wrong in assuming that the stack was

  1. aligned to 16-bytes prior to the CALL
  2. had an 8-byte return addresses push onto the stack with the CALL
  3. requires an additional sub rsp, 8; to get back to 16-byte alignment?
Mcabee answered 2/10, 2018 at 19:36 Comment(5)
yes, stack pointer (Rsp) must be 16-bytes aligned before call (any external api, for self code (in asm, because c/c++ not let you do this, you of course and violate this)). this is simply agreement, because any api wait and based on this. and of course you not need exactly sub rsp, 8. say possible sub rsp, 78h and many othersLazaretto
you are mixing requirement for code written/executed with machine capabilities. The code is required to align stack to 16 bytes before call, but the machine code can easily break that rule by doing something non-conforming, like pushing word (2B = 16b) value on to the stack. Such code will very likely trigger some fault, when the called code will try to use that alignment assumption to its own advantage and the rsp will be wrong. But the CPU itself will not prevent you from calling other code with wrongly aligned rsp.Nipissing
@Nipissing No, I'm asking about meeting the requirements of the x64 ABI. Is it safe to blindly adjust the stack by growing it 8 bytes for a 16-byte alignment after every call.Mcabee
sub rsp,8 is as safe, as it is safe to assume that the code calling you behaves as required (i.e. safe, because code breaking it is bug and should be fix). If the code above fails to fulfill that requirement, then sub rsp,8 will fail too to re-align rsp, and calling next functions may fail due to that. It may take some time you will actually hit function which does effectively use that alignment (for example for aligned vectorized memory access), so you may often get away by calling functions with misaligned rsp, but that's just bug, that didn't demonstrate yet, it's not correct code.Nipissing
related: Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment? - the same design reasons apply for MS.Heuristic
H
9

I'm asking about meeting the requirements of the x64 ABI. Is it safe to blindly adjust the stack by growing it 8 bytes for a 16-byte alignment after every call.

Yes, that's the whole point of the ABI requiring / guaranteeing 16-byte alignment before a call.


You can do whatever you want inside a function, for example 3x 16-bit pushes and then sub rsp, (24 - 3*2) to regain 16-byte stack alignment after entry to a function.

Or movq xmm0, rsp and then use rsp as an extra scratch register to get 16 total integer regs, until you restore it before making another call or ret.1

There's no requirement that RSP be 16-byte aligned after every instruction, only at function call boundaries. This is the why they're called "calling conventions", not "coding standards".

This is a similar concept to rbx being call-preserved. It doesn't matter if you save/restore it on the stack, in xmm0, in static storage, if you negate it and then negate it back again, or if you don't touch it at all. All that matters is that it has the same value when you return to the caller as it did when your function was called.


Footnote 1: Works as long as you don't have any async callbacks / SEH handlers that could possibly run on the user-space stack. This is not really guaranteed to be safe, but may work as a hack.

Is it valid to write below ESP? is related: as Ped7g points out, if something can asynchronously use space below the stack pointer, it will probably break if RSP isn't pointing to stack memory at all.

I've seen a 32-bit example avisynth video filter (I think) that used this to get 8 tmp regs (when no MMX was available), with big warning comments in the code to debug first before using this trick.

Heuristic answered 2/10, 2018 at 19:59 Comment(4)
there has been recently question about accessing rsp-8 in windows, and the conclusion I did saw was, that it is not safe formally (there's only very minor mention in MS docs suggesting the area under rsp is volatile), so you shouldn't use rsp as GPR under windows. Although technically it seems it's safe with current implementation of OS.... wait, you were one of answering that I think, so you should know, hm... Then I'm sort of confused by that example (which is IMO too shortened anyway for somebody not following your thoughts, from mere movq xmm0,rsp it will be difficult to see).Nipissing
@Ped7g: I've seen an example of a 32-bit video filter that used ESP as a temporary. Added more of a footnote about how potentially unsafe that is. And yes, good point that if anything would clobber space below the stack, it breaks if RSP isn't stack space at all.Heuristic
I have tried once to use ESP as a GPR to (allegedly) optimize some code, but it didn't work out to improve the speed. Possible conclusion: Fitting into L1 cache is more important than using another register.Brander
@zx485: It's a pretty niche use-case. But if you bottleneck on uop throughput and you can get rid of some spill/reloads that way, it could be a speedup. I think the case I remember seeing was a pixel-processing or audio DSP loop without MMX, so regs were at a premium. Not fitting in L1i cache with massive unrolling is certainly.Heuristic

© 2022 - 2024 — McMap. All rights reserved.