Releasing multiple locks without causing priority inversion

Asked 16/6, 2016 at 18:38 Answered 27/6, 2016 at 19:17

Solved multithreading threadpool multicore priority-inversion

Short version: How do I release multiple locks from a single thread, without being preempted halfway through?

I have a program which is designed to run on an N-core machine. It consists of one main thread and N worker threads. Each thread (including the main thread) has a semaphore it can block on. Normally, each worker thread is blocked on decrementing its semaphore, and the main thread is running. Every now and then, though, the main thread should wake up the worker threads to do their thing for a certain amount of time, then block on its own semaphore waiting for them all to go back to sleep. Like so:

def main_thread(n):
    for i = 1 to n:
        worker_semaphore[i] = semaphore(0)
        spawn_thread(worker_thread, i)
    main_semaphore = semaphore(0)

    while True:
        ...do some work...
        workers_to_wake = foo()
        for i in workers_to_wake:
            worker_semaphore[i].increment() # wake up worker n
        for i in workers_to_wake:
            main_semaphore.decrement() # wait for all workers

def worker_thread(i):
    while True:
        worker_semaphore(i).decrement() # wait to be woken
        ...do some work...
        main_semaphore.increment() # report done with step

All well and good. The problem is, one of the woken workers may end up preempting the main thread halfway through waking the workers: This can happen, for instance, when the Windows scheduler decides to boost that worker's priority. This doesn't lead to deadlock, but it is inefficient, because the remainder of the threads stay asleep until the preempting worker finishes its work. It's basically priority inversion, with the main thread waiting on one of the workers, and some of the worker threads waiting on the main thread.

I can probably figure out OS- and scheduler-specific hacks for this, such as disabling priority boosting under Windows, and fiddling about with thread priorities and processor affinities, but I'd like something cross-platform-ish and robust and clean. So: How can I wake up a bunch of threads atomically?

Sacksen answered 16/6, 2016 at 18:38 Comment(18)

The premise of this question is false. It's not inefficient. This thread is ready to run, so it would run unless the system cannot accommodate any more ready to run threads. If the system cannot accommodate any more ready to run threads, there is no rush to make more threads ready to run. – Content 20/6, 2016 at 22:49

@DavidSchwartz only on systems with shared run queues. Systems with per-core run queues can end up running lower-priority threads instead of scheduling the main thread or currently unblocked worker threads. Even if thread affinities are set up to mitigate this, the main thread could still share affinity with one of the worker threads. – Sacksen 21/6, 2016 at 7:50

Well, sure, if you set things up badly they'll run badly. The solution is not to set things up badly, not to try to mitigate the harm of a bad setup. All you have to do is not paint yourself into a corner. The defaults handle these kinds of things nearly perfectly. – Content 21/6, 2016 at 8:10

@DavidSchwartz I don't really have a ton of control over other processes running on the user's system, or on the user's OS's scheduler. – Sacksen 21/6, 2016 at 8:11

Then you should just assume it is sensible, since almost all schedulers are. If the user shoots himself in the foot, his foot is going to hurt. – Content 21/6, 2016 at 8:11

@DavidSchwartz This question is based on the observed behavior of a fairly common OS/scheduler, running with the defaults. – Sacksen 21/6, 2016 at 8:12

I think you're misunderstanding the observed behavior then or you've changed some scheduler or affinity settings from their defaults. Windows will not let a core sit idle when there are ready-to-run threads. – Content 21/6, 2016 at 8:14

@DavidSchwartz Not idle, but each core maintains its own run queue, and only work-steals when it's empty, not speculatively to maximize the priority of the running thread. That can lead to background processes being run instead of worker threads. Over time, pooled threads tend to "settle in" to a good core configuration, but since there's N+1 threads for N cores here, that can't reliably happen. – Sacksen 21/6, 2016 at 8:18

I'm kind of puzzled why you didn't ask a question about your actual issue. – Content 21/6, 2016 at 8:25

This is my actual issue. If I could atomically release all the worker threads, there'd be no issue with core contention, because they'd behave like any other worker thread pool. It's the worker-by-worker release process and the N+1 threads that leads to the problem. – Sacksen 21/6, 2016 at 9:30

No, that's not your actual issue. That's your proposed solution to your actual issue. See this link for an explanation of the difference and why that leads to bad questions and unhelpful answers. Please, please read that link. Notice how you said "If I could do Y, I could solve my problem". And "It's X and Y that leads to the problem". All the things you talk about are not the problem itself! – Content 21/6, 2016 at 9:32

I'm aware of the XY problem. This is not an instance of it. My actual issue is the need to release multiple threads without unintended preemption. I described the vagaries of multi-core scheduling because you were convinced that this issue couldn't actually occur in practice, but I'd like a solution which leverages the natural guarantees provided by synchronization primitives, not one which addresses particular schedulers' weaknesses. – Sacksen 21/6, 2016 at 13:34

Out of interest, have you managed to quantify the inefficiency? I'd expect it to be very hard to hit the window where you are looping through your semaphores... And even if you did, the Windows scheduler drops the priority on each time slice, so any boost will vanish almost instantly and so your main thread would get a new slice within tens of milliseconds in this case. If your delay is longer than that, there's something else causing the problem. – Carefree 21/6, 2016 at 22:59

@PeterBrittain It varies. I'd say on 80% of invocations, the threads get scheduled just fine; in the remaining ones, a random number of threads end up waiting. The per-invocation runtimes we're talking about here are on the order of 3 milliseconds, so there generally isn't any time for the quantum to complete. – Sacksen 21/6, 2016 at 23:13

Ok, so I'm guessing you have a seriously loaded system if that 3ms matters... Any reason you're not using a thread pool where each worker just reads from a central work queue? As it stands right now you are regularly blocking your main thread until your slowest operation has completed, which should be having just as much impact on your performance as being preempted. – Carefree 21/6, 2016 at 23:45

@PeterBrittain the objective is time-slicing in a realtime(-ish) graphical application, with each worker processing almost precisely 3ms worth of work during a particular frame. The main thread is supposed to be idle during this. – Sacksen 22/6, 2016 at 9:42

OK - got it! So, you effectively have a desired frame rate and need the main thread to run at that rate, scheduling all your workers at precisely timed intervals so that they have completed their work within the refresh interval. Your workers are not all identical, so you can't just use a single work queue... Right so far? If so, is your main thread actually responsible for assigning work, or is it just a timing mechanism - i.e. just telling each thread when it is safe to run? – Carefree 22/6, 2016 at 10:30

@PeterBrittain that's correct. The main thread does do a bit of coordination in figuring out which threads to step on each frame, but other than that it's just timing. – Sacksen 22/6, 2016 at 11:16

Peter Brittain's solution, plus Anton's suggestion of a "tree-like wakeup", led me to another solution: Chained wakeups. Basically, rather than the main thread doing all the wakeups, it only wakes up one thread; and then each thread is then responsible for waking up the next one. The elegant bit here is that there's only ever one suspended thread ready to run, so threads rarely end up switching cores. In fact, this works fine with strict processor affinities, even if one of the worker threads shares affinity with the main thread.

The other thing I did was to use an atomic counter that worker threads decrement before sleeping; that way, only the last one wakes the main thread, so there's also no chance of the main thread being woken several times just to do more semaphore waiting.

workers_to_wake = []
main_semaphore = semaphore(0)
num_woken_workers = atomic_integer()

def main_thread(n):
    for i = 1 to n:
        worker_semaphore[i] = semaphore(0)
        spawn_thread(worker_thread, i)
    main_semaphore = semaphore(0)

    while True:
        ...do some work...

        workers_to_wake = foo()
        num_woken_workers.atomic_set(len(workers_to_wake)) # set completion countdown
        one_to_wake = workers_to_wake.pop()
        worker_semaphore[one_to_wake].increment() # wake the first worker
        main_semaphore.decrement() # wait for all workers

def worker_thread(i):
    while True:
        worker_semaphore[i].decrement() # wait to be woken
        if workers_to_wake.len() > 0: # more pending wakeups
            one_to_wake = workers_to_wake.pop()
            worker_semaphore[one_to_wake].increment() # wake the next worker

        ...do some work...

        if num_woken_workers.atomic_decrement() == 0: # see whether we're the last one
            main_semaphore.increment() # report all done with step

Sacksen answered 27/6, 2016 at 19:17 Comment(0)

TL; DR

If you really have to get as much as you can out of your workers, just use an event semaphore, a control block and a barrier instead of your semaphores. Note however, that this is a more fragile solution and so you need to balance any potential gains against this downside.

Context

First I need to summarize the broader context in our discussion...

You have a Windows graphical application. It has a desired frame rate and so you need the main thread to run at that rate, scheduling all your workers at precisely timed intervals so that they have completed their work within the refresh interval. This means you have very tight constraints on the start and execution times for each thread. In addition, your worker threads are not all identical, so you can't just use a single work queue.

The problem

Like any modern operating system, Windows has a variety of synchronization primitives. However, none of these directly provides a mechanism for notifying multiple primitives at once. Looking through other operating systems, I see a similar pattern; they all provide ways of waiting on multiple primitives, but none provide an atomic way of triggering them.

So what can we do instead? The problems you need to solve are:

Precisely timing the start of all required workers.
Prodding the workers that actually need to run in the next frame.

Options

The most obvious solution for issue 1 is just to use a single event semaphore, but you could also use a read/write lock (by acquiring the write lock after the workers have finished and getting the workers to use a read lock). All other options are no longer atomic and so will need further synchronization to force the threads to do what you want - like lossleader's suggestion for locks inside your semaphores.

But we want an optimal solution that reduces context switches as much as possible due to the tight time constraints on your application, so let's see if either of these can be used to solve problem 2... How can you pick which worker threads should run from the main if we just have an event semaphore or read/write lock?

Well... A read/write lock is a great way for one thread to write some critical data to a control block and for many others to read from it. Why not just have a simple array of boolean flags (one for each worker thread) that your main thread updates each frame? Sadly you still need to stop execution of the workers until the timer pops. In short we're back at the semaphore and lock solution again.

However, owing to the nature of your application, you can make one more step. You can rely on the fact that you know your workers are not running outside of your time slicing and use an event semaphore as a crude form of lock instead.

A final optimization (if your environment supports them), is to use a barrier instead of the main semaphore. You know that all n threads need to be idle before you can continue, so just insist on it.

A solution

Applying the above, your pseudo-code would then look something like this:

def main_thread(n):
    main_event = event()
    for i = 1 to n:
        worker_scheduled[i] = False
        spawn_thread(worker_thread, i)
    main_barrier = barrier(n+1)

    while True:
        ...do some work...
        workers_to_wake = foo()
        for i in workers_to_wake:
            worker_scheduled[i] = True
        main_event.set()
        main_barrier.enter() # wait for all workers
        main_event.reset()

def worker_thread(i):
    while True:
       main_event.wait()
       if worker_scheduled[i]:
            worker_scheduled[i] = False
            ...do some work...
       main_barrier.enter() # report finished for this frame.
       main_event.reset() # to catch the case that a worker is scheduled before the main thread

Since there is no explicit policing of the worker_scheduled array, this is a much more fragile solution.

I would therefore personally only use it if I had to squeeze every last ounce of processing out of my CPU, but it sounds like that is exactly what you are looking for.

Carefree answered 23/6, 2016 at 0:5 Comment(1)

Thinking about it, the semaphores in this solution are just trying to be a poor man's barrier... Let's change the solution! – Carefree 23/6, 2016 at 11:0

It is not possible when you use multiple synchronization objects (semaphores) when wake-up algorithm complexity is O(n). There are few ways how to solve it though.

release all at once

I'm not sure whether Python has the necessary method (is your question Python-specific?), but generally, semaphores have operations with argument specifying the number to decrements/increments. Thus, you just put all your threads on the same semaphore and wake them all together. Similar approach is to use conditional variable and notify all.

event loops

If you still want to to be able to control each thread individually but like the approach with one-to-many notification, try libraries for asynchronous I/O like libuv (and its Python counterpart). Here, you can create one single event that wakes all the threads at once and also create for each thread its individual event, then just wait on both (or more) event objects in event loops in each thread. Another library is pevents which implements WaitForMultipleObjects on top of pthreads' conditional variables.

delegate waking up

Another approach is to replace your O(n) algorithm with tree-like algorithm ( O(log n) ) where each thread wakes up only fixed number of other threads but delegates them to wake-up others. In the edge case, main thread can wake up only one other thread which will wake-up everyone else or start the chain-reaction. It can be useful if you want to reduce latency for the main thread at expense of wake-up latenies of other threads.

Tarsometatarsus answered 20/6, 2016 at 20:13 Comment(4)

it may also be possible to implement such a synchronization object using low-level primitives but I assume it is out of the scope for this question – Tarsometatarsus 20/6, 2016 at 20:15

The trouble with a shared semaphore is that there are particular workers I'd like to wake up, rather than "any N of them". The event approach is ideal, but pthreads doesn't have events, so I'm constrained there. Delegating waking up seems like an interesting idea.... I'll have to think about it more. – Sacksen 21/6, 2016 at 13:38

@Sneftel, What do you mean by pthread doesn't have events? Pthreads are just threading implementation. What you need is async event loop mechanisms. On Windows, WaitForMultipleObjects works for you, on Linux and other unixes - there are a lot of such mechanisms starting with traditional select and leading with epoll and kpoll. What I was pointing to - is library like libuv which abstracts OS away providing most efficient event loop available on the platform – Tarsometatarsus 21/6, 2016 at 14:30

Ah, I thought you were talking about resettable event objects (such as those returned by CreateEvent() under windows). I'll take a look at libuv. – Sacksen 21/6, 2016 at 15:7

Reader/Writer Lock

The solution I would normally use on POSIX systems for a one to many relationship is a reader/writer lock. It is a surprise to me that they aren't a complete universal, but most languages either implement a version, or at least have a package available to implement them on whatever primitives exist, for example, python's prwlock:

from prwlock import RWLock

def main_thread(n):
    for i = 1 to n:
        worker_semaphore[i] = semaphore(0)
        spawn_thread(worker_thread, i)
    main_lock = RWLock()

    while True:
        main_lock.acquire_write()
        ...do some work...   
        workers_to_wake = foo()
        # The above acquire could be moved as low as here,
        # depending on how independent the above processing is..            
        for i in workers_to_wake:
            worker_semaphore[i].increment() # wake up worker n

        main_lock.release()


def worker_thread(i):
    while True:
        worker_semaphore(i).decrement() # wait to be woken
        main_lock.acquire_read()
        ...do some work...
        main_lock.release() # report done with step

Barriers

Barriers seem like Python's closest intended built-in mechanism to hold up all the threads until they are all alerted, but:

They are a pretty unusual solution, so they would make your code/experience harder to translate to other languages.
I wouldn't like to use them for this case where the number of threads to wake keeps changing. Given that your n sounds small, I would be tempted to use constant Barrier(n) and notify all threads to check if they are running this cycle. But:
I would be concerned that using a barrier would backfire since any of the threads being held up by something external will hold them all up and even a scheduler with resource dependency boosting might miss this relationship. Needing all n to reach the barrier could only make this worse.

Coneflower answered 21/6, 2016 at 22:58 Comment(2)

Clever! having the worker sleep on its own semaphore and then on the main lock seems like it might be inefficient (since each woken worker would end up being switched in up to twice), but this does seem to solve the preemption problem. – Sacksen 22/6, 2016 at 11:19

@Sneftel, in the worst case, it will lead to to N contexts switches to a worker and back – Tarsometatarsus 25/6, 2016 at 17:2

workers_to_wake = []
main_semaphore = semaphore(0)
num_woken_workers = atomic_integer()

def main_thread(n):
    for i = 1 to n:
        worker_semaphore[i] = semaphore(0)
        spawn_thread(worker_thread, i)
    main_semaphore = semaphore(0)

    while True:
        ...do some work...

        workers_to_wake = foo()
        num_woken_workers.atomic_set(len(workers_to_wake)) # set completion countdown
        one_to_wake = workers_to_wake.pop()
        worker_semaphore[one_to_wake].increment() # wake the first worker
        main_semaphore.decrement() # wait for all workers

def worker_thread(i):
    while True:
        worker_semaphore[i].decrement() # wait to be woken
        if workers_to_wake.len() > 0: # more pending wakeups
            one_to_wake = workers_to_wake.pop()
            worker_semaphore[one_to_wake].increment() # wake the next worker

        ...do some work...

        if num_woken_workers.atomic_decrement() == 0: # see whether we're the last one
            main_semaphore.increment() # report all done with step