fastest possible way to pass data from one thread to another

Asked 8/4, 2015 at 6:42 Answered 8/4, 2015 at 12:12

c++multithreading performance boost lock-free

I'm using boost spsc_queue to move my stuff from one thread to another. It's one of the critical places in my software so I want to do it as soon as possible. I wrote this test program:

#include <boost/lockfree/spsc_queue.hpp>
#include <stdint.h>

#include <condition_variable>
#include <thread>

const int N_TESTS = 1000;

int results[N_TESTS];

boost::lockfree::spsc_queue<int64_t, boost::lockfree::capacity<1024>> testQueue;

using std::chrono::nanoseconds;
using std::chrono::duration_cast;

int totalQueueNano(0);
int totalQueueCount(0);

void Consumer() {
    int i = 0;
    int64_t scheduledAt;
    while (i < N_TESTS - 1) {
        while (testQueue.pop(scheduledAt)) {
            int64_t dequeuedAt = (duration_cast<nanoseconds>(
                    std::chrono::high_resolution_clock::now().time_since_epoch())).count();
            auto diff = dequeuedAt - scheduledAt;
            totalQueueNano += diff;
            ++totalQueueCount;
            results[i] = diff;
            ++i;
        }
    }
    for (int i = 0; i < N_TESTS; i++) {
        printf("%d ", results[i]);
    }
    printf("\nspsc_queue latency average nano = %d\n", totalQueueNano / totalQueueCount);
}

int main() {
    std::thread t(Consumer);
    usleep(1000000);
    for (int i = 0; i < N_TESTS; i++) {
        usleep(1000);
        int64_t scheduledAt = (duration_cast<nanoseconds>(
                std::chrono::high_resolution_clock::now().time_since_epoch())).count();
        testQueue.push(scheduledAt);
    }
    usleep(1000000);
    return 0;
}

Compile flags:

g++ -std=c++0x -O3 -Wall -c -fmessage-length=0 -march=native -mtune=native -pthread -MMD -MP -MF"src/TestProject.d" -MT"src/TestProject.d" -o "src/TestProject.o" "../src/TestProject.cpp"

g++ -pthread -o "TestProject"  ./src/TestProject.o   -lpthread

On my machine: RHEL 7.1, gcc 4.8.3, Xeon E5-2690 v3 I receive 290-300 nanoseconds.

How good my test application is? Am I correctly measure "spsc_queue" latency?
What is current industry best time to pass data from one thread to another?
Is it good choice to use boost spsc_queue to move data from one thread to another?
Can you recommend something faster than spsc_queue?
Can you write a code which do same work significantly faster?

upd: queue mechanism is required. if first thread produce data every 1000 nanoseconds, but second thread spents 10 000 nanoseconds to process single item I need to "queue" several items for a short period of time. But my "queue" is never "too big". fixed-size short ring-buffer must be enough.

upd2 So in short the question is - what is the fastest single producer single consumer queue (most likely based on fixed size ringbuffer)? I'm using boost spsc_queue and I achieve ~300 ns latency, can you suggest something faster?

upd3 in java world there is disruptor that achieve 50 ns latency https://code.google.com/p/disruptor/wiki/PerformanceResults Do we have something in c++ with the same 50 ns latency?

Destinee answered 8/4, 2015 at 6:42 Comment(12)

Often the fastest way to pass data is to use a single thread for each chunk of data. That is to say, use only the parallelism present in the data. – Filigree 8/4, 2015 at 6:51

In your benchmark the startup of the consumer-thread could be included in the measured latency. Better wait until the thread started. Also an average is susceptible to spikes. Store each measured latency and output them after the test to manually check for any patterns. – Cotquean 8/4, 2015 at 7:1

i've updated example - added usleep to ensure Consumer thread is ready. print all values to console. all my results are still valid and the same. – Destinee 8/4, 2015 at 7:25

I guess you are aware that you are not actually moving (as in: copying bits) anything from one thread to another? It's just the synchronization/notification latency for the consumer you're interested in? So is your question about minimizing cross-thread notification latency? – Carangid 8/4, 2015 at 7:50

Also, good info in the anwers there: #13021195 – Carangid 8/4, 2015 at 7:53

@MartinBa testQueue.push(i) copies int to spsc_queue, testQueue.pop(i) reads value, it seems it returns reference to internal storage – Destinee 8/4, 2015 at 8:34

@javapowered - Nah, course both pushand pop copy some bits to get the value into and out of the queue object, but what I meant to say was that you're not copying from any thread to any other thread. All the stuff just resides in the free store, accessible to both thread (in principle) as long as proper syncronization is used. – Carangid 8/4, 2015 at 11:3

@MartinBa give me an example where you "copy from any thread to any thread" – Destinee 8/4, 2015 at 11:17

@javapowered - Can't give you an example, because there isn't such a thing as coping from one thread to another, when you look at the process memory. By definition (mostly) threads share the same flat free store and you do not need to copy anything, you just need to get synchronization right. – Carangid 8/4, 2015 at 11:21

@MartinBa i'm not hardware expert but different threads may be executed on different cores. each core has own cache which need to be "refreshed" sometimes, i.e. copied from main memory. anyway I'm already don't understant the point :) – Destinee 8/4, 2015 at 11:32

I would tweak your test, instead of having a separate array for timestamps, simply push the timestamp into the queue, then when you pop on the other side, you know the delay for that entry directly... this will be the minimal amount of time it takes for the consumer to pickup something the producer has produced... – Vicereine 8/4, 2015 at 11:48

@Vicereine agree, good advice. i updated code and my question. however i still nave 300 nanoseconds latency – Destinee 9/4, 2015 at 20:29

Since you have ints, what you (ideally) measure above is the overall latency between a call to push() to the time pop() returns true.

This doesn't make sense: The consumer thread is busily polling the queue, that is it loops and busily checks whether pophas fetched a value.

This is wasteful, and
if you want to minimize latency, polling is certainly not the way to go

If (IFF) you want to minimize latency (for a single item), my guess would be to use a signaling synchronization mechanism, spsc_queue, as far as I can tell, does not provide for this. (You'd need a container or custom solution where you employ a kind of condition variable / Event, ...)

If (IFF), however, you want to maximise throughput (items per time), then measuring the latency for a "wakeup" of a (single) item does make even less sense. In that case you want to make the best use of the parallelism you have, as is mentioned in a comment:

Often the fastest way to pass data is to use a single thread for each chunk of data. That is to say, use only the parallelism present in the data.

Addressing your bullet points:

How good is the test app: I do not think it makes much sense.
- Having scheduledAt in an atomic is required, as you write it from one thread and read it from another. Otherwise you have UB.
- Obviously any measurement difference wrt. this is purely a measurement error and doesn't say anything about the inherent latency. (You could try putting an aggregate struct {int val; int64_t time; }; into the queue, thereby avoiding the atomic fence.
Current industry best time : no clue. Not sure anyone cares about this. (Maybe inside some kernel stuff?)
Choice of spsc_queue : I don't think it is a good choice because it requires polling.
faster than spsc_queue? : See above. Use non-polling notification.
write a code which do same work significantly faster? : No. Or rather, I won't. =>

To quote "man"s answer:

you define the problem and select an appropriate synchronization mechanism

The problem with your question is that there is no problem definition.

As far as I am concerned so far, in the context of a user-land process on a regular OS, cross thread notification latency seems utterly irrelevant. What is your use case?

Carangid answered 8/4, 2015 at 11:18 Comment(5)

"busily polling the queue" is by design. i have 12 cores machine and i can spent several cores for "busily polling" in "hotspots". i need to be able to store several items. so if first thread produce data every 1000 nanoseconds, but second thread spents 10 000 nanoseconds to process single item I need to "queue" several items for a short period of time. that's why i'm using spsc_queue. but 99.999% percent of time spsc_queue is empty and consumer just "busily polling" it. – Destinee 8/4, 2015 at 11:28

You might think you have cores to spend, but but, as I said, my suspicion is that not polling should have lower latency. (I may be wrong, only you can measure this for your case.) – Carangid 8/4, 2015 at 12:11

problem definition - i need single producer single consumer fastest queue – Destinee 8/4, 2015 at 16:26

@javapowered - that's not a problem definition. A problem definition is: "I want to do X, therefore I think I need a fast notification+queue." – Carangid 8/4, 2015 at 18:23

I want to do fast spsc queue therefore I think I need fast spsc queue – Destinee 8/4, 2015 at 19:17

First of all, writing such a test program is completely useless. You don't do any work with the data so the results are skewed. Second, your test is using usleep() between pushes - at this rate you can use any kind of synchronization primitive. It also seems that your Consumer() never exits...

The way you implement such a thing is the following:

you define the problem and select an appropriate synchronization mechanism
you implement the software
you profile the software to identify potential hotspots
you optimize based on the results from the previous step and repeat.

You need some previous experience at the first step or you can try to implement different approaches and see what works best.

Coumarone answered 8/4, 2015 at 7:13 Comment(5)

how would you measure the latency of spsc_queue? – Destinee 8/4, 2015 at 7:27

@javapowered In this case the latency is dependent on how many threads are using the queue simultaneously. It is not something you should care about unless your profiler shows it to be a problem. – Coumarone 8/4, 2015 at 9:59

it's clear from my question that there are only 2 threads - one that produce data and one that consume it. more than 2 threads case is much more complicated – Destinee 8/4, 2015 at 10:10

@javapowered as I've already said, you don't get any result from your test application. Whatever you provided in the question clearly doesn't work. Regarding the latency, this is complicated to measure (depends on many factors). The way you do it is also problematic because there is no guarantee that the threads actually run in parallel (as in the provided code) or that no context switch happens. – Coumarone 8/4, 2015 at 11:17

if you think my application is not correct then modify it to make it correct :) – Destinee 8/4, 2015 at 11:19

It depends on the semantics of the application and how many threads are involved. So far you're looking at raw latency. With more threads, scaling might also start to be an interesting metric.

For the two-threaded case, atomic updates to a single location, preferably in a cache line that's not being touched by any other operations, could be faster if what you're doing with the retrieved data allows it.

Alwin answered 8/4, 2015 at 12:12 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags