boost find in shared-memory method get stuck in c++ multi-process project
Asked Answered
I

1

6

I am using boost's ipc library for saving complex object, include images, in shared memory, used by several processes. Let's call this object MyImage. The shared memory is a circular buffer saving several MyImage objects at a time.

In my code there are two (or more) processes writing to a segment in a shared memory, and another one reading from it. This flow works as expected, but after the reader process is done, or crashed, when it tries to open the same object in shared memory again it get stuck on find method, while the writer processes are still running fine.

I tried to understand which race condition could cause this, but couldn't find any explanation in my code, or in boost's documentation.

Here is a simple code exampled the problem in my project:

The Writer process:

#include <boost/interprocess/managed_shared_memory.hpp>
#include <boost/interprocess/ipc/message_queue.hpp>
#include <boost/interprocess/allocators/allocator.hpp>
#include <boost/circular_buffer.hpp>

using namespace std;
namespace bip = boost::interprocess;

static const char *const PLACE_SHM_NAME = "PlaceInShm";
static const char *const OBJECT_SHM_NAME = "ObjectInShm";
static const char *const PUSH_POP_LOCK = "push_pop_image_lock";
static const int IMAGES_IN_BUFFER = 20;
static const int OBJECT_SIZE_IN_SHM = 91243520;

class MyImage;

typedef bip::managed_shared_memory::segment_manager SegmentManagerType;
typedef bip::allocator<void, SegmentManagerType> MyImageVoidAllocator;
typedef bip::deleter<MyImage, SegmentManagerType> MyImageDeleter;
typedef bip::shared_ptr<MyImage, MyImageVoidAllocator, MyImageDeleter> MyImageSharedPtr;

typedef bip::allocator<MyImageSharedPtr, bip::managed_shared_memory::segment_manager> MyImageShmemAllocator;
typedef boost::circular_buffer<MyImageSharedPtr, MyImageShmemAllocator> MyImageContainer;

MyImageSharedPtr GetMyImage() {
    // some implementation
    return nullptr;
}

int main(int argc, char *argv[]) {

    MyImageContainer *my_image_data_container;
    try {
        bip::named_mutex open_lock{bip::open_or_create, OPEN_SHM_LOCK};
        bip::managed_shared_memory image_segment = bip::managed_shared_memory(bip::open_or_create, PLACE_SHM_NAME, OBJECT_SIZE_IN_SHM);
        my_image_data_container = image_segment.find_or_construct<MyImageContainer>(OBJECT_SHM_NAME)(IMAGES_IN_BUFFER, image_segment.get_segment_manager());
    } catch (boost::interprocess::interprocess_exception &e) {
        exit(1);
    }
    boost::interprocess::named_mutex my_image_mutex_ptr(boost::interprocess::open_or_create, PUSH_POP_LOCK);

    while (true) {
        MyImageSharedPtr img = GetMyImage();
        my_image_mutex_ptr.lock();
        my_image_data_container->push_back(img);
        my_image_mutex_ptr.unlock();
        usleep(1000);
    }
}    

The Reader process:

int main(int argc, char *argv[]) {

    MyImageContainer *my_image_data_container;
    try {
        bip::named_mutex open_lock{bip::open_only, OPEN_SHM_LOCK};
        bip::scoped_lock<bip::named_mutex> lock(open_lock, bip::try_to_lock);
        bip::managed_shared_memory image_segment = bip::managed_shared_memory(bip::open_only, PLACE_SHM_NAME);
        my_image_data_container = image_segment.find<MyImageContainer>(OBJECT_SHM_NAME).first;
    } catch (boost::interprocess::interprocess_exception &e) {
        exit(1);
    }
    boost::interprocess::named_mutex my_image_mutex_ptr(boost::interprocess::open_or_create, PUSH_POP_LOCK);

    while (true) {
        if (my_image_data_container->size() == 0) {
            continue;
        }
        MyImage *img;
        my_image_mutex_ptr.lock();
        img = &(*my_image_data_container->at(0));
        my_image_data_container->pop_front();
        my_image_mutex_ptr.unlock();
        // do stuff with img
        usleep(1000);
    }
}

The flow to reproduce the bug:

  1. Run two processes of the Writer code.
  2. Run one process of the Reader code.
  3. kill the Reader process.
  4. run the Reader process again.

At the second run, the process is stuck in the line image_segment.find<MyImageContainer>(OBJECT_SHM_NAME).first; while the Writer processes are fine.

Important to mention that each Writer process have a unique id, and write to the buffer in the shared memory only int(IMAGES_IN_BUFFER / NUMBER_OF_WRITERS) images starting from the index as his id. For example, I have two Writers with id 0 and id 1, IMAGES_IN_BUFFER=20, then Writer 0 will write to indexes 0-9 and Writer 1 to 10-19.

Some of my debugging process:

  • I tried to open the shared memory in a separate thread, using the future object, and set a timeout of few seconds. But the whole process is still stuck.
  • When I kill the process after it is stuck, and re-run it, it never succeed again, unless I delete the object from shared memory and re-run all of the processes, include the Writers.
  • Usually when running with one Writer I couldn't reproduce the bug, but I can't say for sure.
  • It is not consistent, meaning I can't tell when it will get stuck and when not.
  • Maybe the object in the shared memory is corrupted somehow, while the Reader processes is crashing, and then to while reopen it, it fails. In that case I expect that boost will raise an exception not hang.
  • When the process exit gracefully, with exit code 0, it can happen as well.

Waiting to hear some opinions about what can be the cause of the process getting stuck. Thanks in advance!

Infeudation answered 14/6, 2021 at 8:17 Comment(2)
in the documentation there is note, that might be relevant for you boost.org/doc/libs/1_76_0/doc/html/interprocess/… " The atomic function only blocks named, unique and anonymous creation, search and destruction (concurrent calls to construct<>, find<>, find_or_construct<>, destroy<>...) from other processes." Is it possible that your find is blocked by the writers operation?Diadromous
@AlessandroTeruzzi, it does not sound like this is the case, as the question describes that it does work before the reader crash/exitDaph
M
6

I see many issues with your code. Beyond that, there is also a known limitation. So let's start with that

Robustness of Interprocess Mutexes

First off, it's a wellknown issue with the library that there are no robust interprocess mutexes (portably):

So, the best you can probably do is indeed to do a timed wait and have a "forced clear" option that you can manually engage when you know it is safe to do so.

Code Issues and Review

That said, there are some issues with the code, and some things you may improve to get things less brittle.

  1. You mentioned crashing. This goes without saying: crashes are going to break invariants, avoid them.

  2. This may include having a proper interrupt signal handler to make sure that you properly shut down.

  3. You never lock the open lock in the writer path. This is and obvious problem that should be fixed.

  4. In the reader path you issue a try_lock but never seem to verify that it succeeded.

  5. In the code as shown, you're using my_image_data_container after the shared memory segment is destructed. This, by definition, will always be Undefined Behaviour

  6. You are not using a RAII-enabled lock guard for the push/pop mutex (my_image_mutex_ptr). This means that it is not exception safe and will - again - cause locks to get stuck on exceptions.

  7. In general, you seem to be confusing lockable primitives with locks. I'd suggest renaming the objects (open_lock -> open_mutex, my_image_mutex_ptr (?!) -> modify_mutex) to avoid such confusion.

  8. I would probably suggest using the same mutex for both opening and modification (after all, creating the segment is not really allowed during modification, is it?). Alternatively, consider using an unnamed interprocess mutex inside the shared segment to remove unnecessary SHM namespace pollution. One less lock that could potentially be stuck, even after removing the shared memory segment itself!).

  9. The ->empty() check is a data race:

    • it doesn't lock the modify-mutex so the container might be written to concurrently
    • it doesn't hold that same lock, so right after returning from empty() the return value is no longer reliable, as something may have been modified in the mean time

    In C++, a data race is also invoking Undefined Behaviour

  10. There's big problems here:

    img = &(*container->at(0));
    

    This dereferences the shared pointer, retaining only the raw pointer. However, the next line pop_front()s from the container, so the shared pointer is removed, potentially (likely, given the code shown) destroying the image.

    Just don't lose the refcount, and use the shared pointer.

  11. Many names can be improved for readability. You'd typically think "computers don't care", but humans do. And all the micro-confusions compound, which probably explains 50% of the bugs uncovered in this post.

  12. Some loose ends (exceptions are best caught by const&; you should check the bool on find<>(), container->at(0) can be spelled as container->front() etc.)

Counter Demo

Here's a version of the code reviewed for the above and written with a a bit more modern C++ style. The writer and reader are in a single main now (which you can switch using an arbitrary command line argument).

Live On Coliru

#include <boost/interprocess/managed_shared_memory.hpp>
#include <boost/interprocess/ipc/message_queue.hpp>
#include <boost/interprocess/sync/named_mutex.hpp>
#include <boost/interprocess/smart_ptr/shared_ptr.hpp>
#include <boost/interprocess/allocators/allocator.hpp>
#include <boost/circular_buffer.hpp>
#include <iostream>
#include <mutex>
#include <thread>

namespace bip = boost::interprocess;
using namespace std::chrono_literals;
constexpr char const* OPEN_SHM_LOCK = "OPEN_SHM_LOCK";

static const char* const SHM_NAME         = "PlaceInShm";
static const char* const OBJ_NAME         = "ObjectInShm";
static const char* const PUSH_POP_LOCK    = "push_pop_image_lock";
static const int         BUF_CAPACITY = 20;
static const int         SHM_SIZE         = 91243520;

using Segment = bip::managed_shared_memory;
using Mgr     = Segment::segment_manager;

class MyImage{};

template <typename T> using Alloc = bip::allocator<T, Mgr>;
using MyDeleter                   = bip::deleter<MyImage, Mgr>;
using SharedImage = bip::shared_ptr<MyImage, Alloc<MyImage>, MyDeleter>;

using Container = boost::circular_buffer<SharedImage, Alloc<SharedImage>>;

SharedImage GetMyImage() {
    // some implementation
    return {};
}

int main(int argc, char**) try {
    bool const isWriter = (argc == 1);
    std::cout << (isWriter? "Writer":"Reader") << std::endl;

    // extract variable part to reduce code duplication
    auto find_container = [isWriter](Segment& smt) {
        if (isWriter)
            return smt.find_or_construct<Container>(OBJ_NAME)(
                BUF_CAPACITY, smt.get_segment_manager());

        auto [container, ok] = smt.find<Container>(OBJ_NAME);
        assert(ok); // TODO proper error handling?

        return container;
    };

    bip::named_mutex open_mutex{bip::open_or_create, OPEN_SHM_LOCK};
    if (std::unique_lock open_lk{open_mutex, std::try_to_lock}) {
        Segment smt(bip::open_or_create, SHM_NAME, SHM_SIZE);
        auto container = find_container(smt);

        open_lk.unlock();

        bip::named_mutex modify_mutex(bip::open_or_create, PUSH_POP_LOCK);

        while (isWriter) {
            SharedImage img = GetMyImage();

            {
                std::unique_lock lk(modify_mutex);
                container->push_back(img);
            }
            std::cout << "Pushed" << std::endl;
            std::this_thread::sleep_for(1s);
        }

        while (not isWriter) {
            SharedImage img;

            if (std::unique_lock lk(modify_mutex); !container->empty()) {
                img = std::move(container->front());
                container->pop_front();
            } else {
                continue;
            }

            // if (img)
            {
                // do stuff with img
                std::cout << "Popped" << std::endl;
            }

            std::this_thread::sleep_for(1s);
        }
    } else {
        std::cout << "Failed to acquire open lock" << std::endl;
    }
} catch (bip::interprocess_exception const& e) {
    std::cerr << "Error: " << e.what() << std::endl;
    exit(1);
}

Which on my system works well enough. I left as an exercise for the reader: replacing the modify-lock and adding signal handlers for shutdown.

Recorded Demo

Here's a simple recorded demo that demonstrates it working as expected on my system, using various numbers of readers/writers:

enter image description here

Mahlstick answered 14/6, 2021 at 12:14 Comment(3)
Added a screen recording of a demo run for reference.Mahlstick
thank you! I appreciate your effort. I agree with some of your points, and used in my original code, but not here for simplicity. I will review the others and get back to you. Regarding the combination of the writer and reader in your code, is it a problem using open_or_create segment in both of them? I want that only the writer will be responsible of creating (and deleting) the object in shared memory..Infeudation
Yeah if you want to diagnose wrong invocation of a reader without any writers ever started, you need to distaste that tooMahlstick

© 2022 - 2024 — McMap. All rights reserved.