How to avoid Boost ASIO reactor becoming constrained to a single core?
Asked Answered
F

3

8

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?

I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.

This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.

This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).

Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.

I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.

My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.

All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.

Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?

EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU However, the question remains: How can I prove this? And how can I fix it?

Flamen answered 13/8, 2011 at 18:39 Comment(3)
Have you compared performance using newer versions of asio? Boost 1.40 is a tad old and there were some nice improvements integrated fairly recently.Lobscouse
I'm somewhat constrained to using Ubuntu 10.04 LTS, which comes with boost 1.40. I can perhaps test this on a more modern system, but it still needs to deploy on stock 10.04. I think boost::asio is headers-only, so perhaps this can be made to work...Flamen
Sam: I did try with the latest released Boost, which is 1.47.0. It still has the same problem -- performance refuses to exceed that of a single CPU core (although all the cores are actually doing some work, just mostly blocked).Flamen
F
2

The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.

Flamen answered 25/8, 2011 at 6:13 Comment(4)
could you expand on that a bit, e.g. where this can be seen in source? I'm just curious. :)Marashio
@Jon could it be that the design of ASIO mandates that io_service be one per thread? And just provides locking as a convenience so your program doesn't fail?Downright
Quite to the contrary. The ASIO API says that work will be shared among all threads that enter the same io_service -- that's the whole point of ASIO. The problem really is the lifetime of objects when per-object notifications may be re-ordered.Flamen
@Marashio Just run grep 'unlock' on the entire boost/asio include directory. You'll see numerous places where locks are used.Nonconcurrence
P
2

I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.

I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.

Porism answered 13/10, 2011 at 20:4 Comment(1)
The problem is that each I/O object (timer, socket, etc) has to be dispatched on the io_service that owns it, AFAICT. Statically allocating sockets to threads is not something I want to do.Flamen
N
0

In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.

Nonconcurrence answered 31/10, 2015 at 3:11 Comment(5)
When you disable thread support, you can only run it with a single thread, which will obviously not achieve the goal of not being constrained to a single core.Flamen
Why not use one io_service per thread and still stay lock-free? Is that an invalid use case?Nonconcurrence
Because that defeats the purpose of scaling all work across all threads. Note that sockets and the like are bound to a particular io_service in the constructor.Flamen
The 'work' needs to be done from multiple threads, but does it really need to be done in the thread that uses io_service to perform network communication? If 'work' refers to computation, then you could separate the io_service thread from the 'work' threads. Of course then, these threads will need to communicate. I'm sure by default io_service uses locks as I've seen it using valgrind --tool=callgrind. You'll have to account for that when you design your system.Nonconcurrence
You may be trying to help, but I think you're not actually helping answering the question.Flamen

© 2022 - 2024 — McMap. All rights reserved.