x86 reserved EFLAGS bit 1 == 0: how can this happen?
Asked Answered
J

1

9

I'm using the Win32 API to stop/start/inspect/change thread state. Generally works pretty well. Sometimes it fails, and I'm trying to track down the cause.

I have one thread that is forcing context switches on other threads by:

thread stop
fetch processor state into windows context block
read thread registers from windows context block to my own context block
write thread registers from another context block into windows context block
restart thread

This works remarkably well... but ... very rarely, context switches seem to fail. (Symptom: my multithread system blows sky high executing a strange places with strange register content).

The context control is accomplished by:

if ((suspend_count=SuspendThread(WindowsThreadHandle))<0)
   { printf("TimeSlicer Suspend Thread failure");
      ...
   }
...
Context.ContextFlags = (CONTEXT_INTEGER | CONTEXT_CONTROL | CONTEXT_FLOATING_POINT);
if (!GetThreadContext(WindowsThreadHandle,&Context))
   {   printf("Context fetch failure");
       ...
   }

call ContextSwap(&Context); // does the context swap

if (ResumeThread(WindowsThreadHandle)<0)
   {  printf("Thread resume failure");
        ...
   }

None of the print statements ever get executed. I conclude that Windows thinks the context operations all happened reliably.

Oh, yes, I do know when a thread being stopped is not computing [e.g., in a system function] and won't attempt to stop/context switch it. I know this because each thread that does anything other-than-computing sets a thread specific "don't touch me" flag, while it is doing other-than-computing. (Device driver programmers will recognize this as the equivalent of "interrupt disable" instructions).

So, I wondered about the reliability of the content of the context block. I added a variety of sanity tests on various register values pulled out of the context block; you can actually decide that ESP is OK (within bounds of the stack area defined in the TIB), PC is in the program that I expect or in a system call, etc. No surprises here.

I decided to check that the condition code bits (EFLAGS) were being properly read out; if this were wrong, it would cause a switched task to take a "wrong branch" when its state was restored. So I added the following code to verify that the purported EFLAGS register contains stuff that only look like EFLAGS according to the Intel reference manual (http://en.wikipedia.org/wiki/FLAGS_register).

   mov        eax, Context.EFlags[ebx]  ; ebx points to Windows Context block
   mov        ecx, eax                ; check that we seem to have flag bits
   and        ecx, 0FFFEF32Ah         ; where we expect constant flag bits to be
   cmp        ecx, 000000202h         ; expected state of constant flag bits
   je         @f
   breakpoint                         ; trap if unexpected flag bit status
@@:

On my Win 7 AMD Phenom II X6 1090T (hex core), it traps occasionally with a breakpoint, with ECX = 0200h. Fails same way on my Win 7 Intel i7 system. I would ignore this, except it hints the EFLAGS aren't being stored correctly, as I suspected.

According to my reading of the Intel (and also the AMD) reference manuals, bit 1 is reserved and always has the value "1". Not what I see here.

Obviously, MS fills the context block by doing complicated things on a thread stop. I expect them to store the state accurately. This bit isn't stored correctly. If they don't store this bit correctly, what else don't they store?

Any explanations for why the value of this bit could/should be zero sometimes?

EDIT: My code dumps the registers and the stack on catching a breakpoint.

The stack area contains the context block as a local variable. Both EAX, and the value in the stack at the proper offset for EFLAGS in the context block contain the value 0244h. So the value in the context block really is wrong.

EDIT2: I changed the mask and comparsion values to

and        ecx, 0FFFEF328h         ; was FFEF32Ah where we expect flag bits to be
cmp        ecx, 000000200h   

This seems to run reliably with no complaints. Apparently Win7 doesn't do bit 1 of eflags right, and it appears not to matter.

Still interested in an explanation, but apparently this is not the source of my occasional context switch crash.

Juliajulian answered 1/4, 2014 at 5:44 Comment(13)
+1 just for err.. 'courage and bravery'.Gert
Check if CONTEXT_CONTROL (bit 0) is set in the ContextFlags field.Weld
@MartinJames: this is actually the runtime system for a parallel programming language. The context switching provides a modicum of "fairness" to the many logical execution grains being multiplexed by the grain scheduler on top of the 1-per-CPU threads available to it, courtesy of the MS OS. Yes, this is pushing the boundary of sane :-} See this question for other crazy MS behaviours I have encountered building parallel execution engines: #9462059Juliajulian
@IgorSkochinsky: You forced :-} me to drop the context switching code into the question. Yes, the CONTROL_CONTROL flag is there. Thanks for thinking out of the box. More of that, please!Juliajulian
Are you reinventing fibers BTW?Weld
No. I don't think you can timeslice a fiber (well, using this technique I suppose one could, but MS doesn't offer this). Fibers (AFAIK) can't be stolen to be run on an idle CPU, which our parallel langauge implementation does. Fibers can't wait for other fibers. Fibers can't abort one another. Fibers still have to live with the "big stack" design of window (see https://mcmap.net/q/299021/-how-does-a-stackless-language-work). I think I'm doing something fibers really can't do.Juliajulian
Downvoter: you could have the courtesy to explain why you downvoted.Juliajulian
If you use the kernel debugger and use the .thread command to set the register context to the thread in question, does the r command's register dump agree with what you expect or what you're getting from GetThreadContext()?Freefloating
@MichaelBurr: I've never used the kernal debugger; I stay out of the kernal or anything resembling that. So I can't answer your question. As a thought experiment, I'd expect it to work just fine, especially statistically. The failure rate that I see is something like 1 in 10,000 context switch events. A more interesting experiment (maybe the one you intended?) would be do that experiment when I hit the breakpoint, see if I get nonsense behavior there. How do I find out about the kernal debugger?Juliajulian
@IraBaxter: The Debugging Tools for Windows are now part of the SDK install (you can select the debuggers specifically - you don't have to install the whole SDK): msdn.microsoft.com/en-us/library/windows/hardware/ff551063.aspx You might be able to get the information about the thread's register state using one of the standard debuggers, cdb, ntsd or windbg instead of needing the kernel debugger kbd (or windbg can be used as a kernel debugger, too). The Debugging Tools come with a great help file, debugger.chm that is well worth reading.Freefloating
Russinovich's "Inside Windows" books have great info on how to use the debugging tools for digging for system level information. His sysinternals site also has a livekd tool to let you perform some limited kernel debugging on a 'live system' without having to set up a serial, USB, or Firewire link between a host and target as you normally do for kernel debugging. Another alternative is to use a VMware guest as the kernel debugging target: msdn.microsoft.com/en-us/library/windows/hardware/ff538143.aspxFreefloating
Do you get the same behavior on actual x86 hardware? I've definitely seen emulators take liberties with various register flags.Coppola
No clue what the actual hardware does; that's hard for me to get to, and the problem is the bit comes up wrong only once in a great while. You need a statistical test. Yes, I've seen MS, trying emulate an OS, take all kinds of liberties. (Don't get me started on "SuspendThread" can say "Oops, I didn't do it" for a running thread.)Juliajulian
R
3

Microsoft has a long history of squirreling away a few bits in places that aren't really used. Raymond Chen has given plenty of examples, e.g. using the lower bit(s) of a pointer that's not byte-aligned.

In this case, Windows might have needed to store some of its thread context in an existing CONTEXT structure, and decided to use an otherwise unused bit in EFLAGS. You couldn't do anything with that bit anyway, and Windows will get that bit back when you call SetThreadContext.

Renaerenaissance answered 2/4, 2014 at 15:36 Comment(2)
So, here's a fragile design idea. Store an undocumented bit critical to the correct operation of threads, in an unused bit in the EFLAGS register in a CONTEXT block. Assume that the user program won't change it. (MS doesn't make that assumption for any other register in the CONTEXT block, or indeed for any of the bits in EFLAGs changed by the procesor: Z,O,P,N,DIR, ... I know because I change those a lot and things seem to work fine). Now the user changes that critical bit; the critical function must now fail in an undocumented way. If I had a programmer that did that, I'd have him shot.Juliajulian
... Lots of people use lower bits of non-byte aligned pointers for things. That's at least obvious (nonzero lower bits) and its OK if documented. This seem pretty different. [Don't get me wrong; I appreciate your feedback].Juliajulian

© 2022 - 2024 — McMap. All rights reserved.