Futex throughput on Linux
Asked Answered
Z

1

4

I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.

Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.

EDIT001: What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?

Zenobiazeolite answered 3/4, 2018 at 12:18 Comment(6)
I think that futex is about it. However, you could try to make your threads more independent or let threads process more before passing to other threads. Without knowing your program, it's hard to guess and since it's a high load situation, it's probably also hard with knowing the software.Linguistic
Right, futex (sounds like) is about it, however, if you check the link, you may find that futex has a problem of scaling on multi NUMA node system because of futex hash table contention. So futex naturally limits my application. Mysterious qspinlock sounds more suitable, however didnt find anything useful about it.Zenobiazeolite
It seems like you're using libuv. An event loop should never block. Not on mutex, not on futex, not on IO. You can use uv_async_t to share data between loops (i.e threads). Therefore, you're better off with standard C-style callbacks.Groark
Close call :) SPDK, there is a tight loop processing completion queue, on outstanding IO completion callback is called (lets pretend I use promise/future) and set_value on promise, caller site with future gets notified if get called, no blocking and everyone is happy, right?Zenobiazeolite
Did you solve the problem? If so could you please post the answer?Ebert
@SomeName in this particular case we are looking to switch from preemptive multithreading to cooperative. once you use something like boost::fibers the synchronization is much cheaper. Coroutines will do too. However, in some cases (like in ours) it will require massive code refactoringZenobiazeolite
B
2

In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.

When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.

Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.

Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.

So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.

Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.

(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)

Bianca answered 25/8, 2022 at 10:12 Comment(4)
AFAIR, this is not the case, there were a lot of threads running, more than the number of cores, so it is hard to believe cores have to switch from power down mode to running. To support my case I can add that on single NUMA node machine futex worked (AFAIR) twice faster than on multiNUMA node serverZenobiazeolite
How many threads/cores/active futexes(ones with threads sleeping on them) were we talking here? I still would have thought this was more part of the thread-cpu migration problem, unless the numbers were very high. If FUTEX_SWAP were available, the waker and wakee threads, and the futex metadata would all stay on the same core and the numa problems would go away. I am sure the cost of a futex swapping cores is small vs a whole thread swapping coresBianca
Note the date the question was asked :) of course I dont remember all these detailsZenobiazeolite
yes I did notice & thank you for coming back and commenting! Hopefully what I've posted here is useful to someone, whether or not it's the explanation for the original problem.Bianca

© 2022 - 2024 — McMap. All rights reserved.