How to achieve lock-free, but blocking behavior?

Asked 22/5, 2011 at 18:34 Answered 15/6, 2011 at 20:15

I'm implementing a lock-free single producer single consumer queue for an intensive network application. I have a bunch of worker threads receiving work in their own separate queues, which they then dequeue and process.

Removing the locks from these queues have greatly improved the performance under high load, but they no longer block when the queues are empty, which in turn causes the CPU usage to skyrocket.

How can I efficiently cause a thread to block until it can successfully dequeue something or is killed/interrupted?

Counterblow answered 22/5, 2011 at 18:34 Comment(3)

Hi, can you share with me the rps (request per second) you achieved using the approach? I did a similar type of work (implementing a simple HTTP server) so am interested to know it. I don't know how to contact other than commenting in here. Sorry if I bothered you. – Kennel 28/8, 2012 at 6:5

@Kennel Performance was alright. RPS isn't a good unit for measuring performance due to different hardware setups, etc. I redesigned the application to allow worker threads to operate in complete isolation, and the performance gain was ~10x. Sharing less data was truly the key. – Counterblow 28/8, 2012 at 16:27

Can you explain why you chose an approach of one queue per worker? Sounds pretty suboptimal to me. Execution time of jobs on the queues is hard to foresee. – Grandam 2/12, 2015 at 9:29

If you're on Linux, look into using a Futex. It provides the performance of a non-locking implementation by using atomic operations rather than kernel calls like a mutex would, but should you need to set the process to idle because of some condition not being true (i.e., lock-contention), it will then make the appropriate kernel calls to put the process to sleep and wake it back up at a future event. It's basically like a very fast semaphore.

Lattermost answered 22/5, 2011 at 18:44 Comment(4)

Helpful clarification! I upvoted both Futex-related answers for now. Thanks. – Counterblow 22/5, 2011 at 19:22

+1 for futex. It's performance is not as good as lock-free, but it is good enough and is the perfect choice when mutex locking is too much. pthread mutex API is using futex under the scenes. – Fallible 15/6, 2011 at 20:26

Mutexes on Linux are implemented with a short cmpxchg spin for the low contention case, and falling back to a futex call. I don't really understand why you're calling it non-locking when it implements locking (fast userspace mutex - the origin of the name). – Freeway 26/12, 2013 at 21:46

I'm calling it non-locking because of the low-contention case which is what "typically" occurs unless you're under high load ... – Lattermost 1/1, 2014 at 17:25

On Linux, futex can be used to block a thread. But be aware that Futexes Are Tricky!

UPDATE: condition variables are much safer to use than futexes, and are more portable. However, a condition variable is used in combination with a mutex, so strictly speaking the result will not be lock-free anymore. However, if your primary goal is performance (and not the guaranty of global progress), and the locked portion (i.e. a condition to check after thread wakeup) is small, it might happen that you will get satisfactory results without the need to go into subtleties of integrating futexes into the algorithm.

Derma answered 22/5, 2011 at 18:45 Comment(2)

This looks very interesting. I will explore this shortly and return with my verdict. Thank you. – Counterblow 22/5, 2011 at 19:19

The futex call is an implementation of locking. The mutex API just spins on cmpxchg a bit and then falls back to futex (fast userspace mutex). – Freeway 26/12, 2013 at 21:49

If you're on Windows, you won't be able to use futexes, but Windows Vista has a similar mechanism called Keyed Events. Unfortunately, this isn't part of the published API (it's an NTDLL native API), but you can use it as long as you accept the caveat that it might change in future versions of Windows (and you don't need to run on pre-Vista kernels). Be sure to read the article I linked above. Here's an untested sketch of how it might work:

/* Interlocked SList queue using keyed event signaling */

struct queue {
    SLIST_HEADER slist;
    // Note: Multiple queues can (and should) share a keyed event handle
    HANDLE keyed_event;
    // Initial value: 0
    // Prior to blocking, the queue_pop function increments this to 1, then
    // rechecks the queue. If it finds an item, it attempts to compxchg back to
    // 0; if this fails, then it's racing with a push, and has to block
    LONG block_flag;
};

void init_queue(queue *qPtr) {
    NtCreateKeyedEvent(&qPtr->keyed_event, -1, NULL, 0);
    InitializeSListHead(&qPtr->slist);
    qPtr->blocking = 0;
}

void queue_push(queue *qPtr, SLIST_ENTRY *entry) {
    InterlockedPushEntrySList(&qPtr->slist, entry);

    // Transition block flag 1 -> 0. If this succeeds (block flag was 1), we
    // have committed to a keyed-event handshake
    LONG oldv = InterlockedCompareExchange(&qPtr->block_flag, 0, 1);
    if (oldv) {
        NtReleaseKeyedEvent(qPtr->keyed_event, (PVOID)qPtr, FALSE, NULL);
    }
}

SLIST_ENTRY *queue_pop(queue *qPtr) {
    SLIST_ENTRY *entry = InterlockedPopEntrySList(&qPtr->slist);
    if (entry)
        return entry; // fast path

    // Transition block flag 0 -> 1. We must recheck the queue after this point
    // in case we race with queue_push; however since ReleaseKeyedEvent
    // blocks until it is matched up with a wait, we must perform the wait if
    // queue_push sees us
    LONG oldv = InterlockedCompareExchange(&qPtr->block_flag, 1, 0);

    assert(oldv == 0);

    entry = InterlockedPopEntrySList(&qPtr->slist);
    if (entry) {
        // Try to abort
        oldv = InterlockedCompareExchange(&qPtr->block_flag, 0, 1);
        if (oldv == 1)
            return entry; // nobody saw us, we can just exit with the value
    }

    // Either we don't have an entry, or we are forced to wait because
    // queue_push saw our block flag. So do the wait
    NtWaitForKeyedEvent(qPtr->keyed_event, (PVOID)qPtr, FALSE, NULL);
    // block_flag has been reset by queue_push

    if (!entry)
        entry = InterlockedPopEntrySList(&qPtr->slist);
    assert(entry);

    return entry;
}

You could also use a similar protocol using Slim Read Write locks and Condition Variables, with a lockless fast path. These are wrappers over keyed events, so they may incur more overhead than using keyed events directly.

Unicameral answered 23/5, 2011 at 1:40 Comment(3)

Yeah, but the futex answers were already taken and I thought it might be useful for someone else searching on the subject later :) – Unicameral 23/5, 2011 at 13:12

For completeness, this is interesting--I do occasionally develop for Windows. :-) – Counterblow 23/5, 2011 at 14:29

+1, Also keyed events work fine on pre-Vista, too (they were initually implemented as a single-global-handle-for-all backoff for an out-of-handle deadlock situation with critical sections under 2k). The only difference between 2k/XP and Vista/7/8 is an implementation detail (linked list vs. hash) which makes Vista and later KEs much more efficient if you watch many memory locations with a single handle (no practial difference for 99% of all applications). – Porscheporsena 22/8, 2013 at 15:49

Have you tried conditional waiting? When the queue becomes empty, just start waiting for a new job. The thread putting jobs in the queue should fire the signal. This way you only use locks when the queue is empty.

https://computing.llnl.gov/tutorials/pthreads/#ConditionVariables

Daft answered 22/5, 2011 at 18:49 Comment(7)

Any calls made using pthread functions will make a call into the kernel, which will be significantly slower than atomic operations. The beauty of the futex is that you're using a combination of atomics and a wait-queue at the kernel level, but if there is no lock contention, then you never touch the kernel-level wait queue. On the other-hand, using condition variables automatically places the thread in a wait-queue, so with condition variables, you're going through the kernel no matter what, which is going to be a lot slower if 90% of the time an atomic operation would be fine. – Lattermost 22/5, 2011 at 19:28

@Jason: pthread mutexes on Linux do not enter the kernel in the non-contended case; neither does a cvar signal with no waiters. Of course, a cvar wait must enter the kernel, but so must any blocking operation – Unicameral 22/5, 2011 at 19:42

@bdonlan: Thanks for the info ... but if that's the case, then what makes a futex faster than a normal pthread mutex? According to the paper Fuss, futexes and furwocks: Fast Userlevel Locking in Linux by Franke, Russell, and Kirkwood, the reasoning behind the futex was to avoid a kernel call if there was no lock contention, i.e., when that paper was written, at least the impression I got from reading it was that mutexes and semaphores always required a kernel call. If they never accessed the kernel then what was the point of the futex? – Lattermost 23/5, 2011 at 0:48

@Jason, The futex is the underlying primitive used to create NPTL pthread mutexes. It's a bit complex and difficult to use futex() safely on its own, so the pthread library wraps up some of the more common futex protocols, and calls them mutexes and condition variables. It's also more portable this way - futex() is a Linux-only syscall. – Unicameral 23/5, 2011 at 1:14

@bdolan: Appreciate the clarification. Does that mean though that futexes and pthread mutexes are the same speed in Linux, and there's no benefit to using futexes since mutexes are futexes? Or is there a still a performance penalty for mutexes, and if so, from what? – Lattermost 23/5, 2011 at 4:6

A mutex is implemented with a short spin to handle the low-contention case, and then a system call via futex. It doesn't make sense to use futex directly if you don't know what you're doing. – Freeway 26/12, 2013 at 21:48

@Lattermost There's really no significant advantage of using futex on Linux if all you need is a pthread_mutex. pthread_mutex is written in assembly language and even avoids saving CPU register contents on the stack when callung futex because it knows that futex does not modify them. – Grandam 2/12, 2015 at 9:12

You can cause a thread to sleep by using the sigwait() function. You can wake the thread with pthread_kill. This is much faster than condition variables.

Amatol answered 15/6, 2011 at 20:15 Comment(1)

1) Please provide a reference for your claim that a signal-based mechanism will be "much faster" than a condition variable. In the case that the thread has to wake up, in both cases the scheduling and cacheing aspects play a huge role. 2) Please provide an outline of how to handle race conditions and the case that the thread does not go to sleep but rather picks up an entry from the non-empty queue. In practice, this fast path is the key factor to scalability/performance under load. And it's getting a bit tricky if you have a mixed concept. – Grandam 2/12, 2015 at 9:8

You could add sleeps while it's waiting. Just pick the biggest wait you're willing to have, then do something like this (pseudocode because I don't remember pthread syntax):

WAIT_TIME = 100; // Set this to whatever you're happy with
while(loop_condition) {
   thing = get_from_queue()
   if(thing == null) {
       sleep(WAIT_TIME);
   } else {
       handle(thing);
   }
}

Even something short like a 100 ms sleep should significantly lower the CPU usage. I'm not sure at what point the context switching will make it worse than busy waiting though.

Braid answered 22/5, 2011 at 18:40 Comment(4)

I considered this but the sleep time would have to be ridiculously low as the application is expected to complete requests within a millisecond or two. – Counterblow 22/5, 2011 at 19:17

This doesn't solve the problem as elegantly as other answers. – Dragster 24/6, 2015 at 12:51

@Dragster "Not the best" is usually not a good reason for a downvote, but do what you want. I realize that this solution isn't the best one for the OP (as they comment above), but Stack Overflow answers also come up in search results and this is a good solution in some cases (bursty traffic where you need no locks while handling results, but don't want to waste CPU time when nothing is happening). – Braid 24/6, 2015 at 14:33

It might be nice to mention in the answer the places where this solution is a good one. – Dragster 24/6, 2015 at 16:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags