Windows: avoid pushing full x86 context on stack

Asked 15/6, 2009 at 5:4 Answered 31/8, 2009 at 22:49

exception assembly stack-overflow cpu-registers threadcontext

I have implemented PARLANSE, a language under MS Windows that uses cactus stacks to implement parallel programs. The stack chunks are allocated on a per-function basis and are just the right size to handle local variables, expression temp pushes/pops, and calls to libraries (including stack space for the library routines to work in). Such stack frames can be as small as 32 bytes in practice and often are.

This all works great unless the code does something stupid and causes a hardware trap... at which point Windows appears to insist on pushing the entire x86 machine context "on the stack". This is some 500+ bytes if you include the FP/MMX/etc. registers, which it does. Naturally, a 500 byte push on a 32 byte stack smashes things it should not. (The hardware pushes a few words on a trap, but not the entire context).

[EDIT 11/27/2012: See this for measured details on the rediculous amount of stack Windows actually pushes]

Can I get Windows to store the exception context block someplace else (e.g., to a location specific to a thread)? Then the software could take the exception hit on the thread and process it without overflowing my small stack frames.

I don't think this is possible, but I thought I'd ask a much larger audience. Is there an OS standard call/interface that can cause this to happen?

It would be trivial to do in the OS, if I could con MS into letting my process optionally define a context storage location, "contextp", which is initialized to enable the current legacy behavior by default. Then replacing the interrrupt/trap vector codee:

  hardwareint:   push  context
                mov   contextp, esp

... with ...

  hardwareint:  mov <somereg> contextp
                test <somereg>
                jnz  $2
                push  context
                mov   contextp, esp
                jmp $1 
         $2:    store context @ somereg
         $1:    equ   *

with the obvious changes required to save somereg, etc.

[What I do now is: check the generated code for each function. If it has a chance of generating a trap (e.g., divide by zero), or we are debugging (possible bad pointer deref, etc.), add enough space to the stack frame for the FP context. Stack frames now end up being ~~ 500-1000 bytes in size, programs can't recurse as far, which is sometimes a real problem for the applicaitons we are writing. So we have a workable solution, but it complicates debugging]

EDIT Aug 25: I've managed to get this story to a Microsoft internal engineer who has the authority apparantly to find out who in MS might actually care. There might be faint hope for a solution.

EDIT Sept 14: MS Kernal Group Architect has heard the story and is sympathetic. He said MS will consider a solution (like the one proposed) but unlikely to be in a service pack. Might have to wait for next version of Windows. (Sigh...I might grow old...)

EDIT: Sept 13, 2010 (1 year later). No action on Microsoft's part. My latest nightmare: does taking a trap running a 32 bit process on Windows X64, push the entire X64 context on the stack before the interrupt handler fakes pushing a 32 bit context? That'd be even larger (twice as many integer registers twice as wide, twice as many SSE registers(?))?

EDIT: February 25, 2012: (1.5 years have gone by...) No reaction on Microsoft's part. I guess they just don't care about my kind of parallelism. I think this is a disservice to the community; the "big stack model" used by MS under normal circumstance limits the amount of parallel computations one can have alive at any one instant by eating vast amounts of VM. The PARLANSE model will let one have an application with a million live "grains" in various states of running/waiting; this really occurs in some of our applications where a 100 million node graph is processed "in parallel". The PARLANSE scheme can do this with about 1Gb of RAM, which is pretty manageable. If you tried that with MS 1Mb "big stacks" you'd need 10^12 bytes of VM just for the stack space and I'm pretty sure Windows won't let you manage a million threads.

EDIT: April 29, 2014: (4 years have gone by). I guess MS just doesn't read SO. I've done enough engineering on PARLANSE so we only pay the price of large stack frames during debugging or when there are FP operations going on, so we've managed to find very practical ways to live with this. MS has continued to disappoint; the amount of stuff pushed on the stack by various versions of Windows seems to vary considerably and egregiously above and beyond the need for just the hardware context. There's some hint that some of this variability is caused by non-MS products sticking (e.g. antivirus) sticking their nose in the exception handling chain; why can't they do that from outside my address space? Any, we handle all this by simply adding a large slop factor for FP/debug traps, and waiting for the inevitable MS system in the field that exceeds that amount.

Elegist answered 15/6, 2009 at 5:4 Comment(6)

If you patch ntdll.dll in memory, the changes will only be seen in the current process (copy-on-write). I would assume that a direct address is used, not the IAT, but you could overwrite the first few bytes of the handler with a JMP to your own code and return to ring 3. Windows might have some security in place to prevent this kind of thing, but it's worth a shot. – Ambition 29/8, 2009 at 2:55

Now, that's a thought. You're suggesting the target of the IDT is in ntdll.dll and that I can step on it? How do I figure out where the IDT points, or is that a published entry point in ntdll.dll? Where do I find out more about the structure of ntdll.dll? To echo a phrase I just heard, "This will keep me busy awhile. Thanks"! – Elegist 29/8, 2009 at 4:17

oops.. I've used IDT, I mean interrupt vector or whatever the x86 architecture calls it these days. (I have the x86 manuals, so this is a rhetorical statement :-) – Elegist 29/8, 2009 at 4:19

How about this... Before instructions that may cause an exception you set xSP to point to a location that has enough space for all that on-stack exception data containing the CPU/FPU state and what not and after that instruction you restore xSP? If there's no exception, the overhead is small. If there is, you wouldn't even notice the overhead. – Rumba 7/3, 2012 at 3:24

@Alex: Not a bad idea, if all the interrupts are purely synchonous with respect to some code event. For this language, I also start and stop a thread asynchronously to ensure some degree of computational fairness.. so sometimes such a push can be caused by from outside. I might give that up to get more manageable stack frames. – Elegist 7/3, 2012 at 4:18

@Alex: One of the problems is reporting illegal memory accesses (bad pointer dereferences; people can make that mistake in PARLANSE). If I want to do be able to report that, and unwind stack frames to give a backtrace, the stack frames have to be "undamaged" by the trap. But that means all stack frames have to have space to take the trap... ick. [We actually have a compiler option for this, that we use during code testing where we can sort of afford the space. Production code can really nest millions of calls deep, and we can't afford it there]. – Elegist 3/8, 2012 at 16:16

Basically you would need to re-implement many interrupt handlers, i.e. hook yourself into the Interrupt Descriptor Table (IDT). The problem is, that you would also need to re-implement a kernelmode -> usermode callback (for SEH this callback resides in ntdll.dll and is named KiuserExceptionDispatcher, this triggers all the SEH logic). The point is, that the rest of the system relies upon SEH working the way it does right now, and your solution would break things because you were doing it system wide. Maybe you could check in which process you are at the time of the interrupt. However, the overall concept is prone to errors and very badly affects system stability imho.
These are actually rootkit-like techniques.

Edit:
Some more details: the reason why you would need to re-implement interrupt handlers is, that exceptions (e.g. divide by zero) are essentially software interrupts and those always go through the IDT. When the exception has been thrown, the kernel collects the context and signals the exception back to usermode (through the aforementioned KiUserExceptionDispatcher in ntdll). You'd need to interfere at this point and therefore you would also need to provide a mechanism to get back to user mode. (There is a function in ntdll which is used as the entry point from kernel mode - I don't remember the name but its something with KiUserACP.....)

Discharge answered 17/6, 2009 at 12:19 Comment(4)

Yeah, that's pretty radical. I'm not sure I want to around patching the OS. – Elegist 17/6, 2009 at 13:23

Yes, but there is no other way to achieve what you want, because the whole process of exception handling is triggered from kernel mode. – Discharge 17/6, 2009 at 13:33

I was hoping MS was smart enough to understand the kind of problem I'm having (after all, aren't they providing the foundations for the future in Windows :-), so that all I had to do use the right API. Sounds like No Such Luck. – Elegist 17/6, 2009 at 13:42

So is the IDT visible/changeable by a mere user process? How? – Elegist 8/9, 2012 at 10:32

Consider decoupling the parameter/local stack from the real one. Use another register (e. g. EBP) as the effective stack pointer, leave the ESP-based stack the way Windows wants it.

You can't use PUSH/POP anymore. You'd have to use SUB/MOV/MOV/MOV combo instead of PUSH. But hey, beats patching the OS.

Tchad answered 15/6, 2009 at 5:4 Comment(12)

Yes, that would technically work. It sure gives up lot in code density. The scheme I have works, at the price of making stack frames too big when there are floating point ops around, and/or when the program might trap on an illegal memory reference and I want to provide a good backtrace. We currently compile in two modes: a) production mode, with minimal stack frames (sometimes as small as 32 bytes), but no ability to recover from a machine trap other than "program died @xxx", and b) debug mode, which adds an egregious amount (1500 bytes) to each stack frame, giving enough slop for MS. – Elegist 4/12, 2013 at 15:13

I thought you were out to optimize for speed at the expense of memory. – Tchad 4/12, 2013 at 16:21

Limiting the instruction set you use (especially basic, highly optimized instructions like push and pop) by simulation with multiple instructions to replace their effect, is not going to get you speed. You are right, I don't actually mind code density as I think the processors are astonishingly good at fetching instructions. But the compromise we have made means we don't sacrifice the ability to use any part of the instruction set; it just means we are cross-ways with MS thoughtless stack management. (I've offered a really simple solution in my question, but I doubt MS will ever do it.) – Elegist 4/12, 2013 at 19:37

Even much more prominent software vendors like Parallels are publicly complaining MS won't let them into the kernel. That said, does your model allow for recoverable CPU-level exceptions? In other words, what are the costs of stack space clobbering by the kernel - just inability to get a good crash dump? Also, on x86_64 there's a bunch of extra registers; just sayin'. :) Also, implement a register-based calling convention - this will reduce the need for PUSH considerably. – Tchad 4/12, 2013 at 20:28

Additionally, think of this. The need for a valid ESP-based stack stems from the way x86 processes interrupts, including hardware ones. Anything above ESP is fair game, since an interrupt can come any time. When you move parameters and save registers on an artificial stack - you don't need the stack pointer to be consistent all the time. And static offsets from the frame pointer can be calculated at compile time. In other words, the case for PUSH/POP is not as urgent as it is with the real stack, the one that interrupts come on. – Tchad 4/12, 2013 at 20:47

At this point, with a lot of careful engineering, yes, just inability to get a good crash dump and not even that if we compile in debug mode. Anything above ESP ... you mean below? I understand your solution and why an artificial stack doesn't suffer from Microsoft's abuse :-} It is unfortunate, but on x64 I'll probably have to use MS's calling convention to provide interoperability, therefore continue with the ESP problem. Agreed, lots less need for push and pop. – Elegist 5/12, 2013 at 7:24

Mental models vary on whether the stack grows up or down :) According to Raymond Chen, one should be able to code-switch between both. – Tchad 5/12, 2013 at 20:21

Intel doesn't seem to be listening to Raymond Chen. – Elegist 9/12, 2013 at 23:10

I didn't really answer your question about "recoverable CPU-level exceptions". Yes, by making our stack frames artificially big when we are doing floating point, we have enough stack space for MS to do its dirty deed on an FP trap and come out unscathed on the other side. So PARLANSE has (FP) DivisionByZero and (FP)Overflow exceptions that applications can catch and use in the expected kind of way. This works because the compiler can easily tell if a PARLANSE code block is doing floating point or not, and add the necessary stack slop for that code. – Elegist 29/4, 2014 at 14:18

Yossi Kreinin of Proper Fixation fame writes the same as I do: at some point, rolling your own context management becomes justified. – Tchad 29/4, 2014 at 14:23

Nice link. We're focused on efficient fine grain SMP parallelism for big nonnumeric computations we do. PARLANSE might turn out to be good at waiting Internally, it can already build millions of events and have millions of PARLANSE grains wait on them, efficiently. Externally, life is harder, as we multiplex PARLANSE grains on top of OS threads, and so we'd run out of steam at the OS thread limit... just means we need more state saving. – Elegist 29/4, 2014 at 21:14

And that's the kind of systems that Yossi is writing about. His point is - you can extend the logical parallelism way beyond OS-imposed reasonable thread count, at the cost of having to roll your own context management, which may or may not be stack based. – Tchad 29/4, 2014 at 21:30

If Windows uses x86 hardware to implement their trap code, you need ring 0 access (via driver or API) to change which gate is used for traps.

The x86 concept of gate points one of:

an interrupt address (code segment + offset pointer) which is called while the whole register context, including return address, is pushed on current stack (=current esp), or
a task descriptor, which switches to another task (can be looked upon as hardware-supported thread). All relevant data is pushed to the stack (esp) of that task instead.

You ofcourse want the latter. I would have looked at how Wine implemented it, that might prove more effective than asking google.

My guess is that you unfortunately need to implement a driver to get it working on x86, and according to Wikipedia it is impossible for drivers to change it on IA64 plattform. The second best option might be to interleave space in your stacks, so that a context push from a trap always fits?

Alpheus answered 15/6, 2009 at 12:44 Comment(5)

I can look at Wine, but I'm not sure what I'll learn regarding Windows. First, Wine runs under Linux; there's no specific reason to beleive its OS calls can be used for Windows. Secondly, there's no specific reason to beleive that Windows will let me take control of the hardware interrupt gate or task descriptor. (But, miracles might occur, I'll go look... are you telling me that I can get access thru a standard MS API? Which one? Or are you suggesting I build a driver and cheat?) – Elegist 17/6, 2009 at 6:11

your assumption that the complete context is pushed to an int handler is wrong. The only thing that is guaranteed to lie on the stack is: errorCode (optional), eip, codesegment selector, eflags, esp and stack segment selector (in this order). You cannot change this behaviour because it'S hard-wired in the CPU – Salesin 17/6, 2009 at 16:56

Right, the hardware has to push some context. And this modest amount is fine, and I can always include that in the padding required for my stack frames. There are machine instructions for storing the FP context; carefully done, it can be stored in any large enough buffer, including on the stack. But the hardware isn't pushing the FP context on my stack. Windows seems to be doing it. From my point of view, it doesn't matter whether hardware or Windows does it, if it gets pushed and my stack frame is small. What does matter is whether I can get Windows to not push the FP context. – Elegist 17/6, 2009 at 20:32

Well as I said, you can change what is pushed additionally by re-implementing the respective interrupt handlers, the rest cannot be changed. Of course, windows will need to save the complete context by itself, otherwise it wouldn't be possible for a usermode exception handler to retrieve the thread context (and possibly modify it and have it applied on the next thread schedule). – Salesin 18/6, 2009 at 9:9

Quick comment -- While Wine can be compiled for Windows (supposedly), IIRC it runs completely in user-mode so I don't think looking at its code would help. – Ambition 29/8, 2009 at 2:50

I ran out of space in the comment box...

Anyways I'm not sure where the vector points, I was basing the comment off of SDD's answer and mention of "KiUserExceptionDispatcher"... except upon further searching (http://www.nynaeve.net/?p=201) it looks like at this point it might be too late.

SIDT can be executed in ring 3... this will reveal the contents of the interrupt table, and you may be able to load the segment and at least read the contents of the table. With any luck you can then read the entry for (for example) vector 0/divide by zero, and read the contents of the handler.

At this point I'd try to match hex bytes to match the code with a system file, but there may be a better way to determine which file the code belongs to (it's not necessarily a DLL, it could be win32k.sys, or it could be dynamically generated, who knows). I don't know if there's a way dump the physical memory layout from user-mode.

If all else fails, you could either set up a kernel-mode debugger or emulate Windows (Bochs), where you can view the interrupt tables and memory layout directly. Then you could trace until the point the CONTEXT is pushed, and look for an opportunity to gain control before that happens.

Ambition answered 31/8, 2009 at 22:49 Comment(1)

I really really don't want to patch the kernal code. I just want MS to let me ask to put the context into a buffer I provide, rather that jamming it down my current stack's throat. – Elegist 14/9, 2010 at 4:18

Windows exception handling is called SEH. IIRC you can disable it, but the runtime of the language you are using might not like it.

Damick answered 15/6, 2009 at 8:34 Comment(3)

I know about SEH, and we set that up to point to our exception trap handler. How does one disable it, and where does a hardware trap go then? The runtime of the language I'm using is completely under my contro. Much of the parallel language runtime is implemented in C, but the software cleaverly switches stacks from the cactus style stack to a standard MS "big" stack when running such code; I could switch exception handlers, too, if it solves my stack overflow problem. – Elegist 15/6, 2009 at 8:39

If you disable SEH your app crashes on a divide-by-zero. And if you could somehow disable exceptions, what would you expect the CPU to do on a divide-by-zero..... triple-fault? – Ambition 29/8, 2009 at 2:49

I didn't disable SEH, I merely set it to point to my handler. By the time my handler gets control, Windows has already pushed the full stack frame into the stack. – Elegist 14/9, 2010 at 4:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags