C++ Socket Server - Unable to saturate CPU

Asked 5/8, 2009 at 17:56 Answered 19/6, 2016 at 12:14

Solved c++linux multithreading scalability boost-asio

I've developed a mini HTTP server in C++, using boost::asio, and now I'm load testing it with multiple clients and I've been unable to get close to saturating the CPU. I'm testing on a Amazon EC2 instance, and getting about 50% usage of one cpu, 20% of another, and the remaining two are idle (according to htop).

Details:

The server fires up one thread per core
Requests are received, parsed, processed, and responses are written out
The requests are for data, which is read out of memory (read-only for this test)
I'm 'loading' the server using two machines, each running a java application, running 25 threads, sending requests
I'm seeing about 230 requests/sec throughput (this is application requests, which are composed of many HTTP requests)

So, what should I look at to improve this result? Given the CPU is mostly idle, I'd like to leverage that additional capacity to get a higher throughput, say 800 requests/sec or whatever.

Ideas I've had:

The requests are very small, and often fulfilled in a few ms, I could modify the client to send/compose bigger requests (perhaps using batching)
I could modify the HTTP server to use the Select design pattern, is this appropriate here?
I could do some profiling to try to understand what the bottleneck's are/is

Windowlight answered 5/8, 2009 at 17:56 Comment(6)

Fair to assume you have a 1Gbps port on the server? What are your request and response sizes (on the wire)? – Knockwurst 5/8, 2009 at 18:26

What is the bandwidth utilization on the server network port (the one I assume to be 1Gbps) – Knockwurst 5/8, 2009 at 18:26

The test is running on EC2, which I believe uses Gigabit. Bmon reports about 3MiB (megabits I believe) TX rate, and 2.5Mib RX rate. Many request/response sizes are small (as little as 100 bytes), but some responses are up to 1mb, requests probably up to .25mb – Windowlight 5/8, 2009 at 18:30

What's the load on your clients ? If you only have 1 thread per core, and not utilizing io multiplexing (select/poll or similar) you won't get much concurrency - and the threads will likely spend a lot of time doing i/o. – Traumatism 5/8, 2009 at 19:16

Each client machine is running a process with running 25 threads – Windowlight 5/8, 2009 at 19:21

25 threads for a single process on a 4 core CPU? that's excessive, especially under Linux. – Ferrocene 20/2, 2012 at 7:29

boost::asio is not as thread-friendly as you would hope - there is a big lock around the epoll code in boost/asio/detail/epoll_reactor.hpp which means that only one thread can call into the kernel's epoll syscall at a time. And for very small requests this makes all the difference (meaning you will only see roughly single-threaded performance).

Note that this is a limitation of how boost::asio uses the Linux kernel facilities, not necessarily the Linux kernel itself. The epoll syscall does support multiple threads when using edge-triggered events, but getting it right (without excessive locking) can be quite tricky.

BTW, I have been doing some work in this area (combining a fully-multithreaded edge-triggered epoll event loop with user-scheduled threads/fibers) and made some code available under the nginetd project.

Kaltman answered 6/8, 2009 at 11:30 Comment(12)

Thanks for the info cmeerw, thats interesting stuff. – Windowlight 6/8, 2009 at 12:12

(+1) cmeer I have an unanswered post relating performance of boost::asio in general on windows and linux. If you have read large sections of asio please come and answer my post :P – Ify 14/12, 2009 at 15:31

I was really worried about this global lock. It is not as big an issue as it would seem. The bottle neck can only occur in high through put scenarios. However, when asio is running in epoll mode (linux) it preemptively tries to write or read when the async_* call is issued. In a high input scenario the socket will usually be ready for reading, letting async_read skip epoll entirely. You can't ask for better network performance than that. – Acarpous 21/12, 2009 at 20:6

I don't think it's the case. Yes, it looks like epoll reactor has a scoped lock for the entire duration of the run() function, but it's temporarily released ("lock.unlock();") before calling into epoll_wait and locked again after epoll_wait returns("lock.lock();"). Not sure why it's done this way instead of two scoped locks, though. – Are 1/4, 2010 at 0:29

@Alex Black bump, so that the previous comment reaches the OP. What were your results with this question? Did replacing boost::asio help? – Are 1/4, 2010 at 2:57

@Checkers: Sorry, I didn't go far enough with this to come to any conclusion. – Windowlight 5/4, 2010 at 20:31

@AlexB So this answer is misleading? I am downvoting. – Ferrocene 20/2, 2012 at 7:14

I have just had another look at the current state and am still seeing that there is only one thread inside epoll_wait. Actually, I am seeing a significant slowdown when increasing the number of threads (i.e. it's slower with 4 threads on a quad-core machine than with a single thread on the same machine). I'll provide more details as soon as I have them properly written up. – Kaltman 20/3, 2012 at 7:23

see cmeerw.org/blog/746.html#746 for test-code and some preliminary test results – Kaltman 25/3, 2012 at 0:39

You can remove the lock with the BOOST_ASIO_DISABLE_THREADS macro. If there's only one thread using the io_service, it should be safe to do so. – Handicap 31/10, 2015 at 3:8

I am confused and would really like to know the impact of the global lock. Can it actually degenerate a multithreaded application to a single threaded one? – Rudbeckia 28/11, 2016 at 7:29

Any change to this in 2021? – Summersault 26/7, 2021 at 5:27

As you are using EC2, all bets are off.

Try it using real hardware, and then you might be able to see what's happening. Trying to do performance testing in VMs is basically impossible.

I have not yet worked out what EC2 is useful for, if someone find out, please let me know.

Eccrine answered 6/8, 2009 at 12:59 Comment(4)

This system is going to be deployed in EC2, so the testing the performance of the system on real hardware wouldn't be helpful I don't think. – Windowlight 6/8, 2009 at 13:48

Mark's point is valid: For profiling use a real machine, or at least a more controlled environment. Deploy to EC2 all you like, but understand that you are running in a VM image and that means that your "idle" CPU might just be because some other tenant on the box got all the CPU for a while. And that makes profiling difficult. – Shaky 18/11, 2009 at 6:46

Since there are some several hundred thousand (last I heard) EC2 instances running at any given point in time, I think plenty of people grok what it is useful for. You should ask yourself what they know that you do not. – Robbegrillet 20/12, 2011 at 1:7

+1 for pointing out perf testing in VMs is impossible. Especially for networking scenarios - you have to test with physical boxes, physical switch and be able to monitor QoS. Once you are done - you can push to EC2. And then when you have issues with CPU/RAM usage - you can be sure it is EC2/Rackspace at fault. – Thither 19/2, 2013 at 21:19

From your comments on network utilization,
You do not seem to have much network movement.

3 + 2.5 MiB/sec is around the 50Mbps ball-park (compared to your 1Gbps port).

I'd say you are having one of the following two problems,

Insufficient work-load (low request-rate from your clients)
- Blocking in the server (interfered response generation)

Looking at cmeerw's notes and your CPU utilization figures
(idling at 50% + 20% + 0% + 0%)
it seems most likely a limitation in your server implementation.
I second cmeerw's answer (+1).

Knockwurst answered 6/8, 2009 at 11:46 Comment(1)

He is running tests on Amazon's EC2 Cloud Computing Cluster. Hard to rule out the possibly of bad performance on EC2. – Ferrocene 20/2, 2012 at 7:23

230 requests/sec seems very low for such simple async requests. As such, using multiple threads is probably premature optimisation - get it working properly and tuned in a single thread, and see if you still need them. Just getting rid of un-needed locking may get things up to speed.

This article has some detail and discussion on I/O strategies for web server-style performance circa 2003. Anyone got anything more recent?

Sake answered 6/8, 2009 at 12:51 Comment(5)

Keep in mind the 230 requests/sec are 'application requests' which are composed of many actually HTTP requests. – Windowlight 6/8, 2009 at 13:46

There isn't much locking to get rid of, none in my code, but as cmeerw points out boost::asio does some internal locking. The HTTP server does purely CPU-bounded work, so not using the additional cores would be an expensive waste – Windowlight 6/8, 2009 at 13:49

If the goal is just to saturate the CPU, do the work in one thread and have the other three calculate PI or something. Having multiple user-level threads won't make it easier or faster for the OS and IO hardware to read and write network packets. Threads and cores are for computational work, if you aren't doing any, they can't possibly gain you anything, and risk contention with whatever else the system is doing. – Sake 6/8, 2009 at 14:25

Except, demonstrably, it's not. Optimal solution is probably one thread doing I/O and 2 or 3 the parsing and so on. But that's very likely premature optimisation until you can get your IO properly asynchronously scheduled so you either saturate one CPU core or your network. – Sake 6/8, 2009 at 15:23

I see what you're saying. Well, I'll fire up the server with 1 thread as a quick test and see what comes of that. – Windowlight 6/8, 2009 at 16:11

ASIO is fine for small to medium tasks but it isn't very good at leveraging the power of the underlying system. Neither are raw socket calls, or even IOCP on Windows but if you are experienced you will always be better than ASIO. Either way there is a lot of overhead with all of those methods, just more with ASIO.

For what it is worth. using raw socket calls on my custom HTTP can serve 800K dynamic requests per second with a 4 core I7. It is serving from RAM, which is where you need to be for that level of performance. At this level of performance the network driver and OS are consuming about 40% of the CPU. Using ASIO I can get around 50 to 100K requests per second, its performance is quite variable and mostly bound in my app. The post by @cmeerw mostly explains why.

One way to improve performance is by implementing a UDP proxy. Intercepting HTTP requests and then routing them over UDP to your backend UDP-HTTP server you can bypass a lot of TCP overhead in the operating system stacks. You can also have front ends which pipe through on UDP themselves, which shouldn't be too hard to do yourself. An advantage of a HTTP-UDP proxy is that it allows you to use any good frontend without modification, and you can swap them out at will without any impact. You just need a couple more servers to implement it. This modification on my example lowered the OS CPU usage to 10%, which increased my requests per second to just over a million on that single backend. And FWIW You should always have a frontend-backend setup for any performant site because the frontends can cache data without slowing down the more important dynamic requests backend.

The future seems to be writing your own driver that implements its own network stack so you can get as close to the requests as possible and implement your own protocol there. Which probably isn't what most programmers want to hear as it is more complicated. In my case I would be able to use 40% more CPU and move to over 1 million dynamic requests per second. The UDP proxy method can get you close to optimal performance without needing to do this, however you will need more servers - though if you are doing this many requests per second you will usually need multiple network cards and multiple frontends to handle the bandwidth so having a couple lightweight UDP proxies in there isn't that big a deal.

Hope some of this can be useful to you.

Lem answered 24/2, 2016 at 1:35 Comment(1)

care to show an example or working project? Without it, this is as helpful as irrelevant talk. Not trying to demean you, but some concrete code is needed here.. – Digitalin 19/4, 2017 at 16:26

How many instances of io_service do you have? Boost asio has an example that creates an io_service per CPU and use them in the manner of RoundRobin.

You can still create four threads and assign one per CPU, but each thread can poll on its own io_service.

Wayne answered 19/6, 2016 at 12:14 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags