ManualResetEvent.WaitOne() doesn't return if Reset() is called immediately after Set()
Asked Answered
F

2

7

I have a problem in a production service which contains a "watchdog" timer used to check whether the main processing job has become frozen (this is related to a COM interop problem which unfortunately can't be reproduced in test).

Here's how it currently works:

  • During processing, the main thread resets a ManualResetEvent, processes a single item (this shouldn't take long), then sets the event. It then continues to process any remaining items.
  • Every 5 minutes, the watchdog calls WaitOne(TimeSpan.FromMinutes(5)) on this event. If the result is false, the service is restarted.
  • Sometimes, during normal operation, the service is being restarted by this watchdog even though processing takes nowhere near 5 minutes.

The cause appears to be that when multiple items await processing, the time between the Set() after the first item is processed, and the Reset() before the second item is processed is too brief, and WaitOne() doesn't appear to recognise that the event has been set.

My understanding of WaitOne() is that the blocked thread is guaranteed to receive a signal when Set() is called, but I assume I'm missing something important.

Note that if I allow a context switch by calling Thread.Sleep(0) after calling Set(), WaitOne() never fails.

Included below is a sample which produces the same behaviour as my production code. WaitOne() sometimes waits for 5 seconds and fails, even though Set() is being called every 800 milliseconds.

private static ManualResetEvent _handle;

private static void Main(string[] args)
{
    _handle = new ManualResetEvent(true);

    ((Action) PeriodicWait).BeginInvoke(null, null);
    ((Action) PeriodicSignal).BeginInvoke(null, null);

    Console.ReadLine();
}

private static void PeriodicWait()
{
    Stopwatch stopwatch = new Stopwatch();

    while (true)
    {
        stopwatch.Restart();
        bool result = _handle.WaitOne(5000, false);
        stopwatch.Stop();
        Console.WriteLine("After WaitOne: {0}. Waited for {1}ms", result ? "success" : "failure",
                            stopwatch.ElapsedMilliseconds);
        SpinWait.SpinUntil(() => false, 1000);
    }
}

private static void PeriodicSignal()
{
    while (true)
    {
        _handle.Reset();
        Console.WriteLine("After Reset");
        SpinWait.SpinUntil(() => false, 800);
        _handle.Set();
        // Uncommenting either of the lines below prevents the problem
        //Console.WriteLine("After Set");
        //Thread.Sleep(0);
    }
}

output of the above code


The Question

While I understand that calling Set() closely followed by Reset() doesn't guarantee that all blocked threads will resume, is it also not guaranteed that any waiting threads will be released?

Fussbudget answered 20/3, 2013 at 0:45 Comment(12)
Thread.Sleep(0) doesn't necessarily cause a context switch--it will only switch to same-priority threads. Thread.Sleep(1) will switch to different priority threads, if any are waiting. See bluebytesoftware.com/blog/…Roberson
@PeterRitchie, that's why I said "allow a context switch"! In this case, it definitely makes a difference!Denise
The docs also say "If the Reset method is called before all the threads have resumed execution, the remaining threads once again block" msdn.microsoft.com/en-us/library/ksb7zs2x(v=vs.95).aspx. It also details that autoreset events arent' guarenteed to unblock a thread for roughly the same conditions (msdn.microsoft.com/en-us/library/…)Roberson
My typical usage of manual reset events is one thread calls Set and another thread calls Reset. calling Set then immediately calling Reset doesn't sound like a good idea based on the documentation.Roberson
@PeterRitchie, but isn't it reasonable to expect that at least one thread will be released? When reading the documentation, I took it to mean that I couldn't guarantee all threads would unblock, not that I couldn't guarantee any...Denise
no, the is fundamentally works in quantums. It can't steal time from a user thread to schedule another thread. It allocates and schedules threads at the quantum frequency. If you toggle a signal within that quantum it may never "see" it.Roberson
This has come up before. It seems as though the OS state machine is broken. Set makes the thread/s ready but does not seem to remove them from the MRE waiting queue until they actually become running. If a reset occurs while a thread is still ready, and so not yet running, it is set back to 'waiting' again. I do not think that this behaviour is sane, but that's the way it is.Fatalism
IMHO, calling set on an MRE should make all currently waiting threads ready and remove them from the MRE waiting queue so that a subsequent reset cannot change their state back from ready to waiting. I cannot think of any reason for the OS/MRE behaving in the way it does. It's just silly.Fatalism
@PeterRitchie - is this correct? If a signal makes another thread ready, why can the thread not be made running during the call, assuming that a core is available for it at its current priority, (maybe plus a temporary priority boost). If a currently running thread needs to be preempted during the call, why not? What has the 'quantum' got to do with it? It would seem to me that not doing this defeats the primary reason for using a premptive scheduler in the first place, (high IO performance).Fatalism
@MartinJames It can't be done during the call because that's not an OS thread, that's a user thread. The application created that to do it's work, not the OS's. Windows is a scheduled pre-emptive operating system--when thread get time is done on a schedule. I second the recommendation on "Concurrent Programming on Windows", you'll gain a much better understanding on what's really happening.Roberson
@PeterRitchie - well, the set call should exit via the scheduler, like any normal system call that can change the state of threads. What could be the justification for deferring the full implementation of any such call until later?Fatalism
@PeterRitchie, just for future reference, Sleep(1) is no longer required in OS versions after Server 2003.Denise
J
14

No, this is fundamentally broken code. There are only reasonable odds that the WaitOne() will complete when you keep the MRE set for such a short amount of time. Windows favors releasing a thread that's blocked on an event. But this will drastically fail when the thread isn't waiting. Or the scheduler picks another thread instead, one that runs with a higher priority and also got unblocked. Could be a kernel thread for example. MRE doesn't keep a "memory" of having been signaled and not yet waited on.

Neither Sleep(0) or Sleep(1) are good enough to guarantee that the wait is going to complete, there's no reasonable upper bound on how often the waiting thread could be bypassed by the scheduler. Although you probably ought to shut down the program when it takes longer than 10 seconds ;)

You'll need to do this differently. A simple way is to rely on the worker to eventually set the event. So reset it before you start waiting:

private static void PeriodicWait() {
    Stopwatch stopwatch = new Stopwatch();

    while (true) {
        stopwatch.Restart();
        _handle.Reset();
        bool result = _handle.WaitOne(5000);
        stopwatch.Stop();
        Console.WriteLine("After WaitOne: {0}. Waited for {1}ms", result ? "success" : "failure",
                            stopwatch.ElapsedMilliseconds);
    }
}

private static void PeriodicSignal() {
    while (true) {
        _handle.Set();
        Thread.Sleep(800);   // Simulate work
    }
}
Justicz answered 20/3, 2013 at 1:30 Comment(0)
H
8

You can't "pulse" an OS event like this.

Among other issues, there's the fact that any OS thread performing a blocking wait on an OS handle can be temporarily interrupted by a kernel-mode APC; when the APC finishes, the thread resumes waiting. If the pulse happened during that interruption, the thread doesn't see it. This is just one example of how "pulses" can be missed (described in detail in Concurrent Programming on Windows, page 231).

BTW, this does mean that the PulseEvent Win32 API is completely broken.

In a .NET environment with managed threads, there's even more possibility of missing a pulse. Garbage collection, etc.

In your case, I would consider switching to an AutoResetEvent which is repeatedly Set by the working process and (automatically) reset by the watchdog process each time its Wait completes. And you'd probably want to "tame" the watchdog by only having it check every minute or so.

Hued answered 20/3, 2013 at 1:38 Comment(4)
At the risk of necro, I'm not sure if this still (the same/an) issue in 2024 (Windows 10, .net framework 4.8.1). On the original question there is a comment that the behavior of Sleep(1) changed. The OP's code now works fine, and the link to the PulseEvent no longer works. That being said, events can be missed if immediately set/reset, indicating there is some sort of race condition occurring.Aney
I fixed the link to the PulseEvent blog post. Years ago MS broke all their blog post links, which was really annoying. I don't know if the Windows kernel changed its behavior; I haven't heard anything claiming that it has.Hued
Is there a general way to fix the links that you could share?Aney
@smaudet: I usually remember which article it links to, and then Google for that. Sometimes I've used the Wayback Machine, but the old ones aren't always in there.Hued

© 2022 - 2024 — McMap. All rights reserved.