boost::asio reasoning behind num_implementations for io_service::strand
Asked Answered
E

3

15

We've been using asio in production for years now and recently we have reached a critical point when our servers become loaded just enough to notice a mysterious issue.

In our architecture, each separate entity that runs independently uses a personal strand object. Some of the entities can perform a long work (reading from file, performing MySQL request, etc). Obviously, the work is performed within handlers wrapped with strand. All sounds nice and pretty and should work flawlessly, until we have begin to notice an impossible things like timers expiring seconds after they should, even though threads are 'waiting for work' and work being halt for no apparent reason. It looked like long work performed inside a strand had impact on other unrelated strands, not all of them, but most.

Countless hours were spent to pinpoint the issue. The track has led to the way strand object is created: strand_service::construct (here).

For some reason developers decided to have a limited number of strand implementations. Meaning that some totally unrelated objects will share a single implementation and hence will be bottlenecked because of this.

In the standalone (non-boost) asio library similar approach is being used. But instead of shared implementations, each implementation is now independent but may share a mutex object with other implementations (here).

What is it all about? I have never heard of limits on number of mutexes in the system. Or any overhead related to their creation/destruction. Though the last problem could be easily solved by recycling mutexes instead of destroying them.

I have a simplest test case to show how dramatic is a performance degradation:

#include <boost/asio.hpp>
#include <atomic>
#include <functional>
#include <iostream>
#include <thread>

std::atomic<bool> running{true};
std::atomic<int> counter{0};

struct Work
{
    Work(boost::asio::io_service & io_service)
        : _strand(io_service)
    { }

    static void start_the_work(boost::asio::io_service & io_service)
    {
        std::shared_ptr<Work> _this(new Work(io_service));

        _this->_strand.get_io_service().post(_this->_strand.wrap(std::bind(do_the_work, _this)));
    }

    static void do_the_work(std::shared_ptr<Work> _this)
    {
        counter.fetch_add(1, std::memory_order_relaxed);

        if (running.load(std::memory_order_relaxed)) {
            start_the_work(_this->_strand.get_io_service());
        }
    }

    boost::asio::strand _strand;
};

struct BlockingWork
{
    BlockingWork(boost::asio::io_service & io_service)
        : _strand(io_service)
    { }

    static void start_the_work(boost::asio::io_service & io_service)
    {
        std::shared_ptr<BlockingWork> _this(new BlockingWork(io_service));

         _this->_strand.get_io_service().post(_this->_strand.wrap(std::bind(do_the_work, _this)));
    }

    static void do_the_work(std::shared_ptr<BlockingWork> _this)
    {
        sleep(5);
    }

    boost::asio::strand _strand;
};


int main(int argc, char ** argv)
{
    boost::asio::io_service io_service;
    std::unique_ptr<boost::asio::io_service::work> work{new boost::asio::io_service::work(io_service)};

    for (std::size_t i = 0; i < 8; ++i) {
        Work::start_the_work(io_service);
    }

    std::vector<std::thread> workers;

    for (std::size_t i = 0; i < 8; ++i) {
        workers.push_back(std::thread([&io_service] {
            io_service.run();
        }));
    }

    if (argc > 1) {
        std::cout << "Spawning a blocking work" << std::endl;
        workers.push_back(std::thread([&io_service] {
            io_service.run();
        }));
        BlockingWork::start_the_work(io_service);
    }

    sleep(5);
    running = false;
    work.reset();

    for (auto && worker : workers) {
        worker.join();
    }

    std::cout << "Work performed:" << counter.load() << std::endl;
    return 0;
}

Build it using this command:

g++ -o asio_strand_test_case -pthread -I/usr/include -std=c++11 asio_strand_test_case.cpp -lboost_system

Test run in a usual way:

time ./asio_strand_test_case 
Work performed:6905372

real    0m5.027s
user    0m24.688s
sys     0m12.796s

Test run with a long blocking work:

time ./asio_strand_test_case 1
Spawning a blocking work
Work performed:770

real    0m5.031s
user    0m0.044s
sys     0m0.004s

Difference is dramatic. What happens is each new non-blocking work creates a new strand object up until it shares the same implementation with strand of the blocking work. When this happens it's a dead-end, until long work finishes.

Edit: Reduced parallel work down to the number of working threads (from 1000 to 8) and updated test run output. Did this because when both numbers are close the issue is more visible.

Eagan answered 27/10, 2016 at 17:55 Comment(0)
E
4

Well, an interesting issue and +1 for giving us a small example reproducing the exact issue.

The problem you are having 'as I understand' with the boost implementation is that, it by default instantiates only a limited number of strand_impl, 193 as I see in my version of boost (1.59).

Now, what this means is that a large number of requests will be in contention as they would be waiting for the lock to be unlocked by the other handler (using the same instance of strand_impl).

My guess for doing such a thing would be to disallow overloading the OS by creating lots and lots and lots of mutexes. That would be bad. The current implementation allows one to reuse the locks (and in a configurable way as we will see below)

In my setup:

MacBook-Pro:asio_test amuralid$ g++ -std=c++14 -O2 -o strand_issue strand_issue.cc -lboost_system -pthread
MacBook-Pro:asio_test amuralid$ time ./strand_issue
Work performed:489696

real    0m5.016s
user    0m1.620s
sys 0m4.069s
MacBook-Pro:asio_test amuralid$ time ./strand_issue 1
Spawning a blocking work
Work performed:188480

real    0m5.031s
user    0m0.611s
sys 0m1.495s

Now, there is a way to change this number of cached implementations by setting the Macro BOOST_ASIO_STRAND_IMPLEMENTATIONS.

Below is the result I got after setting it to a value of 1024:

MacBook-Pro:asio_test amuralid$ g++ -std=c++14 -DBOOST_ASIO_STRAND_IMPLEMENTATIONS=1024 -o strand_issue strand_issue.cc -lboost_system -pthread
MacBook-Pro:asio_test amuralid$ time ./strand_issue
Work performed:450928

real    0m5.017s
user    0m2.708s
sys 0m3.902s
MacBook-Pro:asio_test amuralid$ time ./strand_issue 1
Spawning a blocking work
Work performed:458603

real    0m5.027s
user    0m2.611s
sys 0m3.902s

Almost the same for both cases! You might want to adjust the value of the macro as per your needs to keep the deviation small.

Enugu answered 27/10, 2016 at 19:30 Comment(8)
"My guess for doing such a thing would be to disallow overloading the OS by creating lots and lots and lots of mutexes. That would be bad." Why? What overhead is there aside from small constant (per mutex) amount memory?Beulahbeuthel
@yurikilochek They're mutexes. By definition they're useless unless used to synchronize on. That makes it large collections of synchronization primitives being simultaneously waited upon. ::WaitForMultipleObjectsEx might not mind, but that's a context switch, it's not just a few bytes of memory. On linux, there isn't such call AFAIK.Turbulence
@Enugu No matter the number of implementations the issue will persist, because it's in the design. Increasing the number may win some time but only to some extent. In real-time application this will never work. Try my example with work objects equal to number of threads, ie 8 instead of 1000. In that case 1024 implementations barely help (Work performed:8331).Eagan
@Eagan I don't agree that it is an outright design issue. As said earlier you will have to tune the configuration macros as per your requirement. Can you try adding -DBOOST_ASIO_ENABLE_SEQUENTIAL_STRAND_ALLOCATION -DBOOST_ASIO_STRAND_IMPLEMENTATIONS=50000 flag in your build and try ?Enugu
@Eagan One issue I see in the implementation is the raw use of new. I believe you can further speed it up perhaps using a memory allocator and changing the code if you use stand-alone ASIO.Enugu
@Enugu Wouldn't it be easier to have a pool of strands? It would work much better that a vector of 50k strands. If this is no problem to allocate such a huge number of mutexes, why bother with these hacks? Simpler to implement dynamic manager.Eagan
@Eagan ASIO is basically implementing pool of strand service impl only. 50k is lot less compared to number of operations your are doing. There may be a better solution which I am unaware of. I think it would be worthwhile logging a bug/enhancement ticket OR maybe someone already has raised.Enugu
@Eagan Ah..did you mean pool of strand instance you are creating inside work ? You can try that, but I dont think it would be any improvement. Please let me know otherwise.Enugu
D
1

Edit: As of recent Boosts, standalone ASIO and Boost.ASIO are now in sync. This answer is preserved for historical interest.

Standalone ASIO and Boost.ASIO have become quite detached in recent years as standalone ASIO is slowly morphed into the reference Networking TS implementation for standardisation. All the "action" is happening in standalone ASIO, including major bug fixes. Only very minor bug fixes are made to Boost.ASIO. There is several years of difference between them by now.

I'd therefore suggest anyone finding any problems at all with Boost.ASIO should switch over to standalone ASIO. The conversion is usually not hard, look into the many macro configs for switching between C++ 11 and Boost in config.hpp. Historically Boost.ASIO was actually auto-generated by script from standalone ASIO, it may be the case Chris has kept those scripts working, and so therefore you could regenerate a brand shiny new Boost.ASIO with all the latest changes. I'd suspect such a build is not well tested however.

Dacron answered 28/10, 2016 at 12:58 Comment(3)
That's interesting @Niall Douglas. Looking at the release notes, the last version of standalone asio to make it into boost was back in April 2015. That version was asio 1.10.6 whilst the latest asio development release shows 1.10.5 as the last main release, so you're right, they've diverged whilst Chris is concentrating on the Networking Library Proposal , now N4612Imperfection
Unfortunately, strang_impl allocation strategy was not changed in the standalone version. There's some work in the right direction on the strand_executor_service. I have tried to port it to vanilla strand_service but with no luck. Current design depends so much on the guarantee that strand_impl is not destructed, event after strand is, that it's almost impossible to fix without redesign. In any case, I've wrote to the mailing list.Eagan
Boost.Asio and standalone Asio are now in sync, this answer is outdated.Lamellirostral
L
1

Note that if you don't like Asio's implementation you can always write your own strand which creates a separate implementation for each strand instance. This might be better for your particular platform than the default algorithm.

Lamellirostral answered 29/7, 2018 at 21:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.