When is a divide by zero not a divide by zero? A puzzle in the debugger (static variable issues)
Asked Answered
F

2

15

I'm very confused and I think my debugger is lying to me. I have the following loop in my code:

MyClass::UploadFile(CString strFile)
{
  ...
  static DWORD dwLockWaitTime = EngKey::GetDWORD(DNENG_SERVER_UPLOAD_LOCK_WAIT_TIME, DNENG_SERVER_UPLOAD_LOCK_WAIT_TIME_DEFAULT);
  static DWORD dwLockPollInterval = EngKey::GetDWORD(DNENG_SERVER_UPLOAD_LOCK_POLL_INTERVAL, DNENG_SERVER_UPLOAD_LOCK_POLL_INTERVAL_DEFAULT);

  LONGLONG llReturnedOffset(0LL);
  BOOL bLocked(FALSE);
  for (DWORD sanity = 0; (sanity == 0 || status == RESUMABLE_FILE_LOCKED) && sanity < (dwLockWaitTime / dwLockPollInterval); sanity++) 
    {
      ...

This loop has been executed hundreds of times during the course of my program and the two static variables are not changed anywhere in the code, they're written to just once when they're statically initialized and read from in the loop conditions and in one other place. Since they're user settings which are read from the Windows registry they almost always have the constant values of dwLockWaitTime = 60 and dwLockPollInterval = 5. So the loop is always doing 60 / 5.

Very rarely, I get a crash dump which shows that this line of code has thrown a division by zero error. I've checked what WinDbg says and it shows:

FAULTING_IP: 
procname!CServerAgent::ResumableUpload+54a [serveragent.cpp @ 725]
00000001`3f72d74a f73570151c00    div     eax,dword ptr [proc!dwLockPollInterval (00000001`3f8eecc0)]

EXCEPTION_RECORD:  ffffffffffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 000000013f72d74a (proc!CServerAgent::ResumableUpload+0x000000000000054a)
   ExceptionCode: c0000094 (Integer divide-by-zero)
  ExceptionFlags: 00000000
NumberParameters: 0

ERROR_CODE: (NTSTATUS) 0xc0000094 - {EXCEPTION}  Integer division by zero.

I've checked the assembler code and it shows that the crash occurred on this div instruction.

00000001`3f72d744 8b0572151c00    mov     eax,dword ptr [dwLockWaitTime (00000001`3f8eecbc)]
00000001`3f72d74a f73570151c00    div     eax,dword ptr [dwLockPollInterval (00000001`3f8eecc0)]

So as you can see the value at 000000013f8eecbc was moved into eax and then eax was divided by the value at 000000013f8eecc0.

What is at those two values you ask?

0:048> dd 00000001`3f8eecbc
00000001`3f8eecbc  0000003c 00000005 00000001 00000000
00000001`3f8eeccc  00000000 00000002 00000000 00000000
00000001`3f8eecdc  00000000 7fffffff a9ad25cf 7fffffff
00000001`3f8eecec  a9ad25cf 00000000 00000000 00000000
00000001`3f8eecfc  00000000 00000000 00000000 00000000
00000001`3f8eed0c  00000000 00000000 00000000 00000000
00000001`3f8eed1c  00000000 00000000 00000000 00000000
00000001`3f8eed2c  00000000 00000000 00000000 00000000
0:048> dd 000000013f8eecc0
00000001`3f8eecc0  00000005 00000001 00000000 00000000
00000001`3f8eecd0  00000002 00000000 00000000 00000000
00000001`3f8eece0  7fffffff a9ad25cf 7fffffff a9ad25cf
00000001`3f8eecf0  00000000 00000000 00000000 00000000
00000001`3f8eed00  00000000 00000000 00000000 00000000
00000001`3f8eed10  00000000 00000000 00000000 00000000
00000001`3f8eed20  00000000 00000000 00000000 00000000
00000001`3f8eed30  00000000 00000000 00000000 00000000

The constants 60 and 5 exactly as I'd expect. So where's the divide by zero??? Is my debugger lying? Surely the divide by zero has been thrown by the hardware so it can't have made a mistake about that? And if it was a divide by zero in a different place in my code what are the odds that the debugger would show the instruction pointer in exactly this place? I confess, I'm stumped..

Fairlie answered 22/1, 2015 at 19:20 Comment(28)
The fact that you see the expected value in the minidump does not prove that it had the expected value when the instruction executed. Pretty classic behavior of a threading race for example, a minidump is not an instantaneous snapshot that's recorded at the exact time the exception was thrown. Soft RAM errors are another source of these kind of unexplainable mishaps. So if you are sure that threading is not a factor then best advice to give to the customer is to replace his RAM modules.Kaseykasha
If you call the loop as part of some global constructor which could get called then it could end up dividing the two variables before they're initialized to their needed value (and probably default-initialized to 0 by the executable image loader prior to your compiler's bootstrap code is run prior to main() which would actually set their needed value).Decolorant
So two points regarding that line of thinking. If what's in the dump (a full dump by the way rather than mini) is not correct, it would have to be out of date relative to what was executed rather than the other way around. That means that if those two values were corrupted, something would have had to have put them back to the expected values afterwards...Fairlie
Oh and second point.. I've had this from three different machines now. So I don't think it's a RAM problem.Fairlie
As for the global constructor thing, that's not the case. This is part of an upload file function. It's only invoked when a user happens to upload a file, so it's not a static order type problem..Fairlie
Does the code ever re-read the registry settings and reset the static variables?Heiney
No, they're just statically initialized the first time the function is called.Fairlie
could EngKey::GetDWORD() return 0 to you? if so maybe you could code it like `static DWORD dwLockWaitTime = EngKey::GetDWORD(...) ? 60 : EngKey::GetDWORD(...)Word
@Fairlie - So the loop is always doing 60 / 5. This may not answer your question, but why not precompute this value and store in a variable, and then use that variable in the loop?Wholewheat
@Fairlie - Also, If you haven't done so already, I think it's time to think about logging these (important) values (dwLockWaitTime, dwLockPollInterval) instead of trying to figure out what they are from a minidump.Wholewheat
@Word - Yes in theory GetDWORD could return 0 if the user sets a crazy value in the registry (this is a very specific setting so we don't tell many of them about it in the first place) But the debugger seems to show that it's been set to it's default value.Fairlie
@Wholewheat - Both good points, I might well consider doing this if I get desperate. However, right now I'm reluctant to roll out a new client to customers that won't actually fix their problem. So I'm banging my head against it (and inviting all of you to join me) :-) (In actual fact the values are logged but the customers never have logging on when this occurs because it's very rare)Fairlie
@Fairlie - The keyword static has several meanings. Depending on where this code is situatated can make all the difference. Is this code snippet part of a class member function? If so, is your program multithreaded, and if so, can this function can be called by more than one thread?Wholewheat
@Wholewheat - Yes it is a member function, and yes it can be called by at least 4 threads at once.. It should still only get initialized once though right?Fairlie
@Fairlie - What compiler are you using? I think you've got a problem, given your description.Wholewheat
It's visual studio 2013... can it be the compiler though? The assembler looks ok..Fairlie
@Fairlie - The issue that I am referring to is that your function is not thread-safe in versions of the C++ specification below C++11. Static variables still had initialization issues wrt threads. The C++11 was supposed to fix this, but don't know if VS 2013 adheres to it or not.Wholewheat
Ah.. that sounds interesting. I'll look into it thanks. So in theory then.. if 2 threads initialized those statics at the same time, one could have got divide by zero and then another could have put the correct values in place to confound me.. :-)Fairlie
@Fairlie - I am certain that your code is not thread-safe using VS 2010 and below, and possibly with early versions of VS 2013. Make sure you get the latest service pack for VS 2013, since it is supposed to adhere to the new "static variable initialization within a function" rules for C++ 11.Wholewheat
@Fairlie - Yes, the static variables could get all out of whack on multiple threads. That's the bottom line. But again, this is solved by C++ 11 -- now the question is whether VS 2013 still has issues with this.Wholewheat
@Fairlie - Please see this for further information: #8102625Wholewheat
Ahh.. This type of initialization is called a "magic static" in Microsoft's feature list for Visual Studio and it's listed as not implemented in VS2013: msdn.microsoft.com/en-us/library/hh567368.aspxFairlie
But it is listed as fixed in VS2015 preview: visualstudio.com/en-us/news/vs2015-preview-vs.aspxFairlie
@Wholewheat If you want to turn your comments into an answer, I'll happily mark it as the correct one.Fairlie
If the values never change, why not mark them const? Future readers of the code will thank you, and the compiler will be able to complain about more obvious attempts to change them.Fontanel
@AlanStokes - Quite right. If I'd have written in in the first place they would be const ;-)Fairlie
@Fairlie - ok, I placed my comments as an answer. It looks like you'll have to synchronize the variables for VS 2013.Wholewheat
@HansPassant - with regard to the dump not containing the expected value: I would expect a procdump -e 1 (dump on first chance exception) to contain the expected value or wouldn't it? If it would, OP could use that in the future to get better dumps.Hu
W
10

Since the code is part of a member function, and you're calling this function from multiple threads, the static variables are not thread-safe if using a compiler that does not conform to C++ 11 standards. Thus you may get data races when initializing those two static variables.

For a C++ 11 standard conforming compiler, static variables are now guaranteed to be initialized by the first thread, while subsequent threads wait until the static is initialized.

For Visual Studio 2010 and below, static local variables are not guaranteed to be thread safe, since these compilers conform to the C++ 03 and C++ 98 standard.

For Visual Studio 2013, I am not sure of the level of C++ 11 support in terms of static local initialization. Therefore, for Visual Studio 2013, you may have to use proper synchronization to ensure that static local variables are initialized correctly.

For Visual Studio 2015, this item has been addressed and proper static local initialization is fully implemented, so the code you currently have should work correctly for VS 2015 and above.


Edit: For Visual Studio 2013, static local thread-safe initialization is not implemented ("Magic Statics"), as described here.

Therefore, we can cautiously verify that the reason for the original problem is the static-local initialization issue and threading. So the solution (if you want to stick with VS 2013) is to use proper synchronization, or redesign your application so that static variables are no longer needed.

Wholewheat answered 22/1, 2015 at 21:3 Comment(3)
Do you own Studio Studio 2010 and could you write an SSCCE and post the binary somewhere?Ending
I use VS 2008 and VS 2013. I skipped over VS 2010. However, since these compilers adhere to C++ 03 standards (and below), there is no guarantee of static locals being initialized by multiple threads in a thread-safe manner. Given that Microsoft admits having this finally fixed in Visual Studio 2015, we can assume that any code in VS 2010 that uses static local variables, and have this code called from multiple threads, is not thread safe.Wholewheat
I have no doubts that you're right, I'd just like to debug it myself and perhaps teach others debugging it in more detail, e.g. by looking at the assembly code. I guess it could be done with the Express Edition, so I'll look for a copy of that.Ending
K
7

The problem may be related to multithreading.

  1. A thread enters the function
  2. Checks the hidden "is_initialized" static variable to see if initialization has already been performed
  3. The var is 0, so it sets the variable to 1 and proceeds reading the registry
  4. At this point another thread enters the function
  5. The second thread sees the variables as already initialized and skips the initialization code
  6. The division is performed when the denominator is still 0 (the first thread is still reading the registry)
  7. The program crashes, but in the meanwhile the first thread completes execution, setting the variables that you see in the dump.
  8. You lose sleep thinking how the impossible happened
Kiangsu answered 22/1, 2015 at 20:57 Comment(3)
Do I understand correctly that it can happen on a multi processor CPU only and cannot happen on a single core CPU due to thread scheduling? If so, could setting the affinity of the program to a single CPU prove that?Ending
@ThomasW: reading from the registry is a potentially very long operation. It's not impossible that during it even a single-core CPU will terminate the timeslice for the first thread (20ms IIRC on windows) and will schedule the second thread triggering the problem.Kiangsu
Yes, the problem will still occur, but on a single CPU, wouldn't the exception stop the process so that the debugger shows the correct value (0 in this case)?Ending

© 2022 - 2024 — McMap. All rights reserved.