We've been using asio in production for years now and recently we have reached a critical point when our servers become loaded just enough to notice a mysterious issue.
In our architecture, each separate entity that runs independently uses a personal strand
object. Some of the entities can perform a long work (reading from file, performing MySQL request, etc). Obviously, the work is performed within handlers wrapped with strand. All sounds nice and pretty and should work flawlessly, until we have begin to notice an impossible things like timers expiring seconds after they should, even though threads are 'waiting for work' and work being halt for no apparent reason. It looked like long work performed inside a strand had impact on other unrelated strands, not all of them, but most.
Countless hours were spent to pinpoint the issue. The track has led to the way strand
object is created: strand_service::construct
(here).
For some reason developers decided to have a limited number of strand
implementations. Meaning that some totally unrelated objects will share a single implementation and hence will be bottlenecked because of this.
In the standalone (non-boost) asio library similar approach is being used. But instead of shared implementations, each implementation is now independent but may share a mutex
object with other implementations (here).
What is it all about? I have never heard of limits on number of mutexes in the system. Or any overhead related to their creation/destruction. Though the last problem could be easily solved by recycling mutexes instead of destroying them.
I have a simplest test case to show how dramatic is a performance degradation:
#include <boost/asio.hpp>
#include <atomic>
#include <functional>
#include <iostream>
#include <thread>
std::atomic<bool> running{true};
std::atomic<int> counter{0};
struct Work
{
Work(boost::asio::io_service & io_service)
: _strand(io_service)
{ }
static void start_the_work(boost::asio::io_service & io_service)
{
std::shared_ptr<Work> _this(new Work(io_service));
_this->_strand.get_io_service().post(_this->_strand.wrap(std::bind(do_the_work, _this)));
}
static void do_the_work(std::shared_ptr<Work> _this)
{
counter.fetch_add(1, std::memory_order_relaxed);
if (running.load(std::memory_order_relaxed)) {
start_the_work(_this->_strand.get_io_service());
}
}
boost::asio::strand _strand;
};
struct BlockingWork
{
BlockingWork(boost::asio::io_service & io_service)
: _strand(io_service)
{ }
static void start_the_work(boost::asio::io_service & io_service)
{
std::shared_ptr<BlockingWork> _this(new BlockingWork(io_service));
_this->_strand.get_io_service().post(_this->_strand.wrap(std::bind(do_the_work, _this)));
}
static void do_the_work(std::shared_ptr<BlockingWork> _this)
{
sleep(5);
}
boost::asio::strand _strand;
};
int main(int argc, char ** argv)
{
boost::asio::io_service io_service;
std::unique_ptr<boost::asio::io_service::work> work{new boost::asio::io_service::work(io_service)};
for (std::size_t i = 0; i < 8; ++i) {
Work::start_the_work(io_service);
}
std::vector<std::thread> workers;
for (std::size_t i = 0; i < 8; ++i) {
workers.push_back(std::thread([&io_service] {
io_service.run();
}));
}
if (argc > 1) {
std::cout << "Spawning a blocking work" << std::endl;
workers.push_back(std::thread([&io_service] {
io_service.run();
}));
BlockingWork::start_the_work(io_service);
}
sleep(5);
running = false;
work.reset();
for (auto && worker : workers) {
worker.join();
}
std::cout << "Work performed:" << counter.load() << std::endl;
return 0;
}
Build it using this command:
g++ -o asio_strand_test_case -pthread -I/usr/include -std=c++11 asio_strand_test_case.cpp -lboost_system
Test run in a usual way:
time ./asio_strand_test_case
Work performed:6905372
real 0m5.027s
user 0m24.688s
sys 0m12.796s
Test run with a long blocking work:
time ./asio_strand_test_case 1
Spawning a blocking work
Work performed:770
real 0m5.031s
user 0m0.044s
sys 0m0.004s
Difference is dramatic. What happens is each new non-blocking work creates a new strand
object up until it shares the same implementation with strand
of the blocking work. When this happens it's a dead-end, until long work finishes.
Edit:
Reduced parallel work down to the number of working threads (from 1000
to 8
) and updated test run output. Did this because when both numbers are close the issue is more visible.