Using Boost.Lockfree queue is slower than using mutexes

Asked 21/4, 2017 at 10:57 Answered 11/12, 2018 at 17:44

Solved c++multithreading performance boost lock-free

Until now I was using std::queue in my project. I measured the average time which a specific operation on this queue requires.

The times were measured on 2 machines: My local Ubuntu VM and a remote server. Using std::queue, the average was almost the same on both machines: ~750 microseconds.

Then I "upgraded" the std::queue to boost::lockfree::spsc_queue, so I could get rid of the mutexes protecting the queue. On my local VM I could see a huge performance gain, the average is now on 200 microseconds. On the remote machine however, the average went up to 800 microseconds, which is slower than it was before.

First I thought this might be because the remote machine might not support the lock-free implementation:

From the Boost.Lockfree page:

Not all hardware supports the same set of atomic instructions. If it is not available in hardware, it can be emulated in software using guards. However this has the obvious drawback of losing the lock-free property.

To find out if these instructions are supported, boost::lockfree::queue has a method called bool is_lock_free(void) const;. However, boost::lockfree::spsc_queue does not have a function like this, which, for me, implies that it does not rely on the hardware and that is is always lockfree - on any machine.

What could be the reason for the performance loss?

Exmple code (Producer/Consumer)

// c++11 compiler and boost library required

#include <iostream>
#include <cstdlib>
#include <chrono>
#include <async>
#include <thread>
/* Using blocking queue:
 * #include <mutex>
 * #include <queue>
 */
#include <boost/lockfree/spsc_queue.hpp>


boost::lockfree::spsc_queue<int, boost::lockfree::capacity<1024>> queue;

/* Using blocking queue:
 * std::queue<int> queue;
 * std::mutex mutex;
 */

int main()
{
    auto producer = std::async(std::launch::async, [queue /*,mutex*/]() 
    {
        // Producing data in a random interval
        while(true)
        {
            /* Using the blocking queue, the mutex must be locked here.
             * mutex.lock();
             */

            // Push random int (0-9999)
            queue.push(std::rand() % 10000);

            /* Using the blocking queue, the mutex must be unlocked here.
             * mutex.unlock();
             */

            // Sleep for random duration (0-999 microseconds)
            std::this_thread::sleep_for(std::chrono::microseconds(rand() % 1000));
        }
    }

    auto consumer = std::async(std::launch::async, [queue /*,mutex*/]() 
    {
        // Example operation on the queue.
        // Checks if 1234 was generated by the producer, returns if found.

        while(true)
        {
            /* Using the blocking queue, the mutex must be locked here.
             * mutex.lock();
             */

            int value;
            while(queue.pop(value)
            {
                if(value == 1234)
                    return;
            }

            /* Using the blocking queue, the mutex must be unlocked here.
             * mutex.unlock();
             */

            // Sleep for 100 microseconds
            std::this_thread::sleep_for(std::chrono::microseconds(100));
        }
    }

    consumer.get();
    std::cout << "1234 was generated!" << std::endl;
    return 0;
}

Vermin answered 21/4, 2017 at 10:57 Comment(8)

Please consider adding a minimal reproducible example that allows to reproduce your performance measurements. That would allow for a more practical answer. – Narrows 21/4, 2017 at 12:7

Given the high interest in this question, it is really unfortunate that core aspect of performance discrepancy on the two different systems cannot be answered. I think there is much more potential for a specific practical answer if the question were improved. – Narrows 28/4, 2017 at 8:46

@Narrows I will try to add a concrete example soon. – Vermin 1/5, 2017 at 16:11

@Narrows I added some example code. – Vermin 3/5, 2017 at 21:16

The example likely suffers from bad performance, because rand should not be used in a multithreaded environment. Use C++11 <random> or rand_r instead, see also. How do you time the code and how does it correspond to your performance observations? – Narrows 3/5, 2017 at 22:10

@Narrows Oh, I did not know about that. The real code does not use rand(). Actually, the producer receives updates from a websocket server (that's what I was trying to model with rand()). I will rewrite the example this evening. Do you think it would be better for a minimal example to include the "real" code, i.e. websocket connections etc., or should I try to model it like I did before? – Vermin 4/5, 2017 at 6:36

It would certainly be better in a modeled way without websockets. That way it is much easier to reproduce. However, the example must exhibit the performance anomaly you observe. Please make sure to include a description of the performance observation with the example, as well the involved systems. – Narrows 4/5, 2017 at 9:15

your assumption that the absence of is_lock_free implies anything is incorrect. In the implementation we see they all use atomic operations, which all directly depend on CPU support – Vladi 15/1, 2020 at 20:9

158

Lock free algorithms generally perform more poorly than lock-based algorithms. That's a key reason they're not used nearly as frequently.

The problem with lock free algorithms is that they maximize contention by allowing contending threads to continue to contend. Locks avoid contention by de-scheduling contending threads. Lock free algorithms, to a first approximation, should only be used when it's not possible to de-schedule contending threads. That only rarely applies to application-level code.

Let me give you a very extreme hypothetical. Imagine four threads are running on a typical, modern dual-core CPU. Threads A1 and A2 are manipulating collection A. Threads B1 and B2 are manipulating collection B.

First, let's imagine the collection uses locks. That will mean that if threads A1 and A2 (or B1 and B2) try to run at the same time, one of them will get blocked by the lock. So, very quickly, one A thread and one B thread will be running. These threads will run very quickly and will not contend. Any time threads try to contend, the conflicting thread will get de-scheduled. Yay.

Now, imagine the collection uses no locks. Now, threads A1 and A2 can run at the same time. This will cause constant contention. Cache lines for the collection will ping-pong between the two cores. Inter-core buses may get saturated. Performance will be awful.

Again, this is highly exaggerated. But you get the idea. You want to avoid contention, not suffer through as much of it as possible.

However, now run this thought experiment again where A1 and A2 are the only threads on the entire system. Now, the lock free collection is probably better (though you may find that it's better just to have one thread in that case!).

Almost every programmer goes through a phase where they think that locks are bad and avoiding locks makes code go faster. Eventually, they realize that it's contention that makes things slow and locks, used correctly, minimize contention.

Jesselyn answered 21/4, 2017 at 11:3 Comment(10)

Hmm I think I am currently in the same phase where I tend to avoid locks :p, is there any other resource/book I could read where such topics are touched? – Labiche 21/4, 2017 at 11:28

@AbhinavGauniyal I get asked that a lot, and I've yet to find one. I've spoken to many people who have gone through the same painful process that I did many years ago. – Jesselyn 21/4, 2017 at 11:29

You explain how blocking lock implementations, with oversubscribed cores can reduce contention for separate locks. I don't think the question gives enough information to know if this applies here. – Narrows 21/4, 2017 at 12:28

@Narrows It's just an example. The same thing can happen even if you're not oversubscribed. (See the parenthetical at the end of the second to last paragraph.) Nonblocking collections are an optimization that helps in a very narrow set of cases and hurts in the most typical ones. I'm just trying to help someone understand why that is and what locks do to make performance better. (And keep the answer to a reasonable length.) – Jesselyn 21/4, 2017 at 18:3

The question is about a specific performance observation with lock-free queues vs mutexes. While minding contention is certainly very valid in general, I don't think there is strong indication of an XY-problem in this question. I doubt the first sentence - replacing one generalization with another - and I don't think your argument supports it. Your argument is basically that it's contention that matters, not the specific mechanism to avoid data races, and blocking lock implementation can alleviate contention in some scenarios. – Narrows 21/4, 2017 at 22:13

Does your answer give any insight to the observation that lock-free queue gave a 275% speedup on one system and a 6% slowdown on another? (I do think that this is impossible to answer without wild speculation due to a lack of information in the question.) – Narrows 21/4, 2017 at 22:17

@Narrows My example shows how the performance of a lock-free algorithm can critically depend on the number of cores. But I suspect something was horribly wrong with the system, implementation, or test code that produces the strangely low performance. – Jesselyn 21/4, 2017 at 22:30

The important part of the answer is Any time threads try to contend, the conflicting thread will get de-scheduled. (see also https://mcmap.net/q/302909/-how-pthread_mutex_lock-is-implemented for how pthread_mutex_lock() is implemented on Linux). Effectively the operating system can run something more useful on the core than the thread waiting to acquire the lock. If the lock were a spinlock instead, this argument would not apply. – Daube 31/8, 2018 at 7:52

IMHO lock-free data structures shine as an ideal buffering solution when you want to enable thread priorization without risking priority-inversion problems. Picture a high-priority UDP capture loop. You don't want the producer to block and potentially lose packets if the consumer is preempted holding the sync lock. This technique is very valuable in .NET where you can create an unmanaged data-capture thread and bridge the data (using a boost lock-free container) , to a managed thread that will be subject to constant blocking during garbage collection activity. C++/CLI is your friend! – Hettie 28/5, 2019 at 16:41

(continued…) In these type of scenarios, I think that the possibiltiy of avoiding priority inversions, outweighs the cost of slower performance. – Hettie 28/5, 2019 at 17:0

I cannot say that the boost lockfree queue is slower in all possible cases. In my experience, the push(const T& item) is trying to make a copy. If you are constructing tmp objects and pushing on the queue, then you are hit by a performance drag. I think the library just need the overloaded version push(T&& item) to make movable object more efficient. Before the addition of the new function, you might have to use pointers, the plain type, or the smart ones offered after C++11. This is a rather limited aspect of the queue, and I only use the lockfree queue vary rarely.

Dennie answered 11/12, 2018 at 17:44 Comment(0)

Exmple code (Producer/Consumer)

Recommended topics

Hot tags