How to make a multiple-read/single-write lock from more basic synchronization primitives?
Asked Answered
M

8

50

We have found that we have several spots in our code where concurrent reads of data protected by a mutex are rather common, while writes are rare. Our measurements seem to say that using a simple mutex seriously hinders the performance of the code reading that data. So what we would need is a multiple-read/single-write mutex. I know that this can be built atop of simpler primitives, but before I try myself at this, I'd rather ask for existing knowledge:

What is an approved way to build a multiple-read/single-write lock out of simpler synchronization primitives?

I do have an idea how to make it, but I'd rather have answers unbiased by what I (probably wrongly) came up with. (Note: What I expect is an explanation how to do it, probably in pseudo code, not a full-fledged implementation. I can certainly write the code myself.)

Caveats:

  • This needs to have reasonable performance. (What I have in mind would require two lock/unlock operations per access. Now that might not be good enough, but needing many of them instead seems unreasonable.)

  • Commonly, reads are more numerous, but writes are more important and performance-sensitive than reads. Readers must not starve writers.

  • We are stuck on a rather old embedded platform (proprietary variant of VxWorks 5.5), with a rather old compiler (GCC 4.1.2), and boost 1.52 – except for most of boost's parts relying on POSIX, because POSIX isn't fully implemented on that platform. The locking primitives available basically are several kind of semaphores (binary, counting etc.), on top of which we have already created mutexes, conditions variables, and monitors.

  • This is IA32, single-core.

Marienthal answered 9/1, 2015 at 12:33 Comment(15)
"This needs to have reasonable performance" - so does a regular mutex have "reasonable performance"? Why not?Offcenter
@sehe: When, under the hood of such an implementation, a reader needs to lock and release three mutexes in order to access the data, then that is unreasonable performance.Marienthal
"Proprietary version of VxWorks" - what platform - IA32, IA64 or another VxWorks-supported platform (PPC/ARM/etc)?Ganef
@frasnian: This is IA32.Marienthal
@Marienthal as part of your dev tools do you have WindView? It's a very cool graphical way to see exactly what's going on in your system, might help find out where the performance is going in the first place. Think ftrace (or dtrace), but a whole heap better.Orchard
@Marienthal Just a thought - you shouldn't have to create your own mutexes, and condition variables; they're already implemented in VxWorks. There's a native mutex (see SemMCreate()), and POSIX condition variables (see semPxLib). In case you've not got it, see the programmers guide at ing.iac.es/~docs/external/vxworks.old/Programmers-Guide-5.5.pdfOrchard
@bazza: No, we have a proprietary IDE (based on an older Eclipse version). But we're not using this for anything but deployment anymore, having implemented our own (SCons-based) build system in vanilla eclipse instead.Marienthal
@Orchard In fact, semMCreate() (the sem prefix stands for semaphore) is exactly what we wrapped our thin mutex class around. Conditions variables, however, we based on binary semaphores. Given that most of POSIX support is lacking, I see little value in using semPxLib, which, as you(?) suggested elsewhere, is very likely just a thin layer atop of binary semaphores – which we might just as well wrap ourselves.Marienthal
@sbi, ah, a C++ wrapper, I understand. Yes, it was me. Pity you haven't got WindView, it makes diagnosis a lot easier. The IDE based on old Eclipse sounds 'unofficial'; I hope your vendor is actually selling you runtime licenses. The WindRiver development tools are very expensive, (probably why your vendor has supplied something else), but were well worth it on my projects (manpower saved was quite high when debugging). It might be worth giving the counting semaphore idea a go - easy to replace your existing mutex. If it's quicker, result! If not, not much time wasted trying it out.Orchard
@Marienthal It might pay for you to update your question with your platform information (cpu, os if any...). Though you're asking for generalized psuedo code I suspect that some of the guys here might be able to give you an existing solution, for your platform, that would suffice... My initial thoughts went to RCU style locking but it really depends on what system you are running...Agnate
@Marienthal Though not pseudo code, or even a proper explanation, looking at sqlite.org/lockingv3.html could offer an alternative insight to a custom design.Agnate
@nonsensickle: I've added a statement that we're talking single-core IA32 here.Marienthal
@bazza: " I hope your vendor is actually selling you runtime licenses." I am not sure how to understand that, but I certainly believe that the VxWorks they are selling is fully licensed. (This isn't a bad vendor. We're not stuck with an old VxWorks version because we couldn't be bothered to look for something better, but because there are other good reasons to to stick with this company's products. It's just that they have a – good! – proprietary platform atop of VxWorks that they haven't ported to a newer VxWorks version.)Marienthal
@sbi, in my experience not every vendor supplying VxWorks was actually licensed to do so by WindRiver! The OS itself has no inbuilt licensing controls, but the dev tools did and the license usage was auditable by WindRiver. Reputable vendors should be OK though. VxWorks hasn't moved very far, it's now only got to version 7.Orchard
This is a solution in C++11+ based on the Standard Library: linkedin.com/pulse/… . I guess you'd have to rewrite it using pthreads. But I questioning whether, for a single core, how much good it would do. With a single core, you can't really have two simultaneous readers. You could only have multiple simultaneously-runnable reader threads, but one core can only run one thread. So, at best, I would think you'd see a small reduction in longest read latency.Coo
C
22

It seems like you only have mutex and condition_variable as synchronization primitives. therefore, I write a reader-writer lock here, which starves readers. it uses one mutex, two conditional_variable and three integer.

readers - readers in the cv readerQ plus the reading reader
writers - writers in cv writerQ plus the writing writer
active_writers - the writer currently writing. can only be 1 or 0.

It starve readers in this way. If there are several writers want to write, readers will never get the chance to read until all writers finish writing. This is because later readers need to check writers variable. At the same time, the active_writers variable will guarantee that only one writer could write at a time.

class RWLock {
public:
    RWLock()
    : shared()
    , readerQ(), writerQ()
    , active_readers(0), waiting_writers(0), active_writers(0)
    {}

    void ReadLock() {
        std::unique_lock<std::mutex> lk(shared);
        while( waiting_writers != 0 )
            readerQ.wait(lk);
        ++active_readers;
        lk.unlock();
    }

    void ReadUnlock() {
        std::unique_lock<std::mutex> lk(shared);
        --active_readers;
        lk.unlock();
        writerQ.notify_one();
    }

    void WriteLock() {
        std::unique_lock<std::mutex> lk(shared);
        ++waiting_writers;
        while( active_readers != 0 || active_writers != 0 )
            writerQ.wait(lk);
        ++active_writers;
        lk.unlock();
    }

    void WriteUnlock() {
        std::unique_lock<std::mutex> lk(shared);
        --waiting_writers;
        --active_writers;
        if(waiting_writers > 0)
            writerQ.notify_one();
        else
            readerQ.notify_all();
        lk.unlock();
    }

private:
    std::mutex              shared;
    std::condition_variable readerQ;
    std::condition_variable writerQ;
    int                     active_readers;
    int                     waiting_writers;
    int                     active_writers;
};
Compassion answered 24/1, 2015 at 1:59 Comment(6)
Actually, all we have is semaphores, the rest is built atop of them. (Also, we don't have C++11/14's synchronization stuff, but that I file under "pseudo code".) Anyway, this looks pretty much like what I had in mind, only you show that it only needs a mutex for managing the lock's data, and none for actually accessing the data. Thanks for this one, it's definitely a bonus candidate!Marienthal
@sbi. One improvement I could think is to give more privilege to late writers.When writer come, put it in the front of the mutex queue. In this way, you use a deque as the underlying data structure in the mutex. Then every time calling unlock, the writer will get out first even if it come late.Compassion
After looking over all the answers together with a colleague, I decided to give the bonus to this answer because the algorithm here seems reasonably sound, was posted earlier than Howard's similar algorithm, and has the advantage over Howard's that it fulfills the requirement of favoring writers.Marienthal
However, note that we had considerable problems making us give the bonus to an answer that sports code which cannot even decide whether it uses pre-increment (because that's better) or post-increment (because that's traditional), whether to use braces or not for single-line condition bodies, explicitly unlocks locks shortly because they go out of scope, and has other style issues, too. I am tempted to clean up this mess just to prevent others thinking bad of me because I knighted code which is that bad. For a C++ programmer, this truly is awfully bad code.Marienthal
@Marienthal Thank you for accepting my answer. However, even though I just stepped into industry, I don't think my code is "awfully bad". anyway, feel free to change the code and I can see where to improve.Compassion
I have unified the bracing style, changed everything to pre-increment, now all members are initialized explicitly, and two variables got renamed in order to make their meaning more obvious. (Note: I don't quibble over where/when braces or commas are put, so feel free to change this. But I do care about consistency.)Marienthal
B
23

At first glance I thought I recognized this answer as the same algorithm that Alexander Terekhov introduced. But after studying it I believe that it is flawed. It is possible for two writers to simultaneously wait on m_exclusive_cond. When one of those writers wakes and obtains the exclusive lock, it will set exclusive_waiting_blocked = false on unlock, thus setting the mutex into an inconsistent state. After that, the mutex is likely hosed.

N2406, which first proposed std::shared_mutex contains a partial implementation, which is repeated below with updated syntax.

class shared_mutex
{
    mutex    mut_;
    condition_variable gate1_;
    condition_variable gate2_;
    unsigned state_;

    static const unsigned write_entered_ = 1U << (sizeof(unsigned)*CHAR_BIT - 1);
    static const unsigned n_readers_ = ~write_entered_;

public:

    shared_mutex() : state_(0) {}

// Exclusive ownership

    void lock();
    bool try_lock();
    void unlock();

// Shared ownership

    void lock_shared();
    bool try_lock_shared();
    void unlock_shared();
};

// Exclusive ownership

void
shared_mutex::lock()
{
    unique_lock<mutex> lk(mut_);
    while (state_ & write_entered_)
        gate1_.wait(lk);
    state_ |= write_entered_;
    while (state_ & n_readers_)
        gate2_.wait(lk);
}

bool
shared_mutex::try_lock()
{
    unique_lock<mutex> lk(mut_, try_to_lock);
    if (lk.owns_lock() && state_ == 0)
    {
        state_ = write_entered_;
        return true;
    }
    return false;
}

void
shared_mutex::unlock()
{
    {
    lock_guard<mutex> _(mut_);
    state_ = 0;
    }
    gate1_.notify_all();
}

// Shared ownership

void
shared_mutex::lock_shared()
{
    unique_lock<mutex> lk(mut_);
    while ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_)
        gate1_.wait(lk);
    unsigned num_readers = (state_ & n_readers_) + 1;
    state_ &= ~n_readers_;
    state_ |= num_readers;
}

bool
shared_mutex::try_lock_shared()
{
    unique_lock<mutex> lk(mut_, try_to_lock);
    unsigned num_readers = state_ & n_readers_;
    if (lk.owns_lock() && !(state_ & write_entered_) && num_readers != n_readers_)
    {
        ++num_readers;
        state_ &= ~n_readers_;
        state_ |= num_readers;
        return true;
    }
    return false;
}

void
shared_mutex::unlock_shared()
{
    lock_guard<mutex> _(mut_);
    unsigned num_readers = (state_ & n_readers_) - 1;
    state_ &= ~n_readers_;
    state_ |= num_readers;
    if (state_ & write_entered_)
    {
        if (num_readers == 0)
            gate2_.notify_one();
    }
    else
    {
        if (num_readers == n_readers_ - 1)
            gate1_.notify_one();
    }
}

The algorithm is derived from an old newsgroup posting of Alexander Terekhov. It starves neither readers nor writers.

There are two "gates", gate1_ and gate2_. Readers and writers have to pass gate1_, and can get blocked in trying to do so. Once a reader gets past gate1_, it has read-locked the mutex. Readers can get past gate1_ as long as there are not a maximum number of readers with ownership, and as long as a writer has not gotten past gate1_.

Only one writer at a time can get past gate1_. And a writer can get past gate1_ even if readers have ownership. But once past gate1_, a writer still does not have ownership. It must first get past gate2_. A writer can not get past gate2_ until all readers with ownership have relinquished it. Recall that new readers can't get past gate1_ while a writer is waiting at gate2_. And neither can a new writer get past gate1_ while a writer is waiting at gate2_.

The characteristic that both readers and writers are blocked at gate1_ with (nearly) identical requirements imposed to get past it, is what makes this algorithm fair to both readers and writers, starving neither.

The mutex "state" is intentionally kept in a single word so as to suggest that the partial use of atomics (as an optimization) for certain state changes is a possibility (i.e. for an uncontended "fast path"). However that optimization is not demonstrated here. One example would be if a writer thread could atomically change state_ from 0 to write_entered then he obtains the lock without having to block or even lock/unlock mut_. And unlock() could be implemented with an atomic store. Etc. These optimizations are not shown herein because they are much harder to implement correctly than this simple description makes it sound.

Belton answered 25/1, 2015 at 19:53 Comment(9)
Does the flaw you cite really qualify as a flaw given that the requirement here is only to allow a single writer?Dissatisfactory
I admittedly haven't put the time in for a complete analysis. At the very least I can see a possibility where this second waiting writer is hidden from readers, and thus readers can starve it. Or by "single writer", do you mean only one writer thread? If that is a requirement, I missed that one. The algorithm in my answer can have many reader threads, many writer threads, lets many readers get the lock, or a single writer get the lock, and starves none.Belton
I originally thought it meant there was only a single writer thread, but after rereading the question, I'm a lot less certain of that.Dissatisfactory
@Jerry: I have just found a spot in the code where there are many writers/producers, but only one reader/consumer... :(Marienthal
@Marienthal I'm not sure that this is as efficient as using a single counting semaphore anyway. Condition variables are implemented in VxWorks using a couple more semaphores. This would mean that the readers are touching more than one semaphore. The writer might more more efficient though.Orchard
@sbi, Presumably where there's many writers they are not supposed to be able to write concurrently? In which case this solution of Howard's would still work I think. Oh in case you're worried about unique_lock and lock_guard the same functionality can be got from SemTake() and SemGive().Orchard
In case it isn't clear to readers, the case of many writers and only one reader is equivalent to exclusive access. One might as well use a mutex. Assuming exactly equivalent efficiency of the mutex and shared_mutex implementations there could never be a performance benefit to using shared_mutex. And if the shared_mutex happens to be less efficient (because it is more complicated), then mutex wins by default. If the shared_mutex happens to be more efficient than mutex, then that is simply a bug in the implementation.Belton
This got upvoted much, and I almost implicitly trust if Howard posts an algorithm that has been looked at for years, but I do not see that this is an improvement over qqibrow's answer – which is older –, and do believe the upvotes mostly come from other people also implicitly trusting Howard. Therefore, I will not give the bonus to this answer.Marienthal
@Marienthal The advantage of this one is that it does not starve writer threads.If you have a steady stream of readers with the other algorithm no writer will ever acquire the lock. This algorithm seems immune to that.Earwax
C
22

It seems like you only have mutex and condition_variable as synchronization primitives. therefore, I write a reader-writer lock here, which starves readers. it uses one mutex, two conditional_variable and three integer.

readers - readers in the cv readerQ plus the reading reader
writers - writers in cv writerQ plus the writing writer
active_writers - the writer currently writing. can only be 1 or 0.

It starve readers in this way. If there are several writers want to write, readers will never get the chance to read until all writers finish writing. This is because later readers need to check writers variable. At the same time, the active_writers variable will guarantee that only one writer could write at a time.

class RWLock {
public:
    RWLock()
    : shared()
    , readerQ(), writerQ()
    , active_readers(0), waiting_writers(0), active_writers(0)
    {}

    void ReadLock() {
        std::unique_lock<std::mutex> lk(shared);
        while( waiting_writers != 0 )
            readerQ.wait(lk);
        ++active_readers;
        lk.unlock();
    }

    void ReadUnlock() {
        std::unique_lock<std::mutex> lk(shared);
        --active_readers;
        lk.unlock();
        writerQ.notify_one();
    }

    void WriteLock() {
        std::unique_lock<std::mutex> lk(shared);
        ++waiting_writers;
        while( active_readers != 0 || active_writers != 0 )
            writerQ.wait(lk);
        ++active_writers;
        lk.unlock();
    }

    void WriteUnlock() {
        std::unique_lock<std::mutex> lk(shared);
        --waiting_writers;
        --active_writers;
        if(waiting_writers > 0)
            writerQ.notify_one();
        else
            readerQ.notify_all();
        lk.unlock();
    }

private:
    std::mutex              shared;
    std::condition_variable readerQ;
    std::condition_variable writerQ;
    int                     active_readers;
    int                     waiting_writers;
    int                     active_writers;
};
Compassion answered 24/1, 2015 at 1:59 Comment(6)
Actually, all we have is semaphores, the rest is built atop of them. (Also, we don't have C++11/14's synchronization stuff, but that I file under "pseudo code".) Anyway, this looks pretty much like what I had in mind, only you show that it only needs a mutex for managing the lock's data, and none for actually accessing the data. Thanks for this one, it's definitely a bonus candidate!Marienthal
@sbi. One improvement I could think is to give more privilege to late writers.When writer come, put it in the front of the mutex queue. In this way, you use a deque as the underlying data structure in the mutex. Then every time calling unlock, the writer will get out first even if it come late.Compassion
After looking over all the answers together with a colleague, I decided to give the bonus to this answer because the algorithm here seems reasonably sound, was posted earlier than Howard's similar algorithm, and has the advantage over Howard's that it fulfills the requirement of favoring writers.Marienthal
However, note that we had considerable problems making us give the bonus to an answer that sports code which cannot even decide whether it uses pre-increment (because that's better) or post-increment (because that's traditional), whether to use braces or not for single-line condition bodies, explicitly unlocks locks shortly because they go out of scope, and has other style issues, too. I am tempted to clean up this mess just to prevent others thinking bad of me because I knighted code which is that bad. For a C++ programmer, this truly is awfully bad code.Marienthal
@Marienthal Thank you for accepting my answer. However, even though I just stepped into industry, I don't think my code is "awfully bad". anyway, feel free to change the code and I can see where to improve.Compassion
I have unified the bracing style, changed everything to pre-increment, now all members are initialized explicitly, and two variables got renamed in order to make their meaning more obvious. (Note: I don't quibble over where/when braces or commas are put, so feel free to change this. But I do care about consistency.)Marienthal
B
5

Concurrent reads of data protected by a mutex are rather common, while writes are rare

That sounds like an ideal scenario for User-space RCU:

URCU is similar to its Linux-kernel counterpart, providing a replacement for reader-writer locking, among other uses. This similarity continues with readers not synchronizing directly with RCU updaters, thus making RCU read-side code paths exceedingly fast, while furthermore permitting RCU readers to make useful forward progress even when running concurrently with RCU updaters—and vice versa.

Bergerac answered 23/1, 2015 at 11:1 Comment(12)
And that can be used one a platform from the 90s which doesn't even support POSIX?Marienthal
@Marienthal That is right. You can check it out and see if it builds and works.Bergerac
I don't think I will introduce a until now totally unknown library to this codebase if a well-written rw-lock would do. (We're doing fault-sensitive stuff.)Marienthal
@Marienthal The idea behind RCU is pretty simple and straightforward. Readers use an atomic instruction to read a pointer. Writers atomically update the pointer to a new value. The only difficulty is disposing of an old value and there are several ways of doing it, this is what those white papers explore at lengths.Bergerac
We haven't found atomic operations in that OS's API.Marienthal
@sbi: atomic operations aren't usually in an OS API - the IA32 has the machine opcodes, and if you don't want to compile/link some assembler look for C++ compiler intrinsics/builtins or inline assembly. Google quickly turns up the CPU instructions - e.g. hereIe
@Marienthal I can understand you being cautious but please watch this video before making your mind up. He explains the concept really well and should put you at ease about using it.Agnate
I have talked this through with a co-worker of mine: 1. This would require adding a new and to us unknown library to the code, which we are reluctant to do. 2. We believe that there are spots in our code where continuing to work with old values after new ones are available for some value of "too long" might be a very fatal thing to do. 3. This doesn't come across as the answer of someone who has used this approach for years, knows its ins and outs, and enthusiastically explains how to use it if we wonder about a specific case. Rather than that, it's just a link to a lib and a quote...Marienthal
...which isn't even considered a good answering style on SO. 4. This seems to require dynamic allocation, which is an additional cost we may not be able to afford in all cases we want to use this for. Therefore this answer will not get the bonus.Marienthal
@Marienthal Nope, the mechanism does not require dynamic allocation.Bergerac
@Maxim: However, this was our impression. While it might be wrong, the answer certainly didn't make this clear.Marienthal
@Marienthal I can almost feel you pain :)Bergerac
O
4

There's some good tricks you can do to help.

First, good performance. VxWorks is notable for its very good context switch times. Whatever the locking solution you use it will likely involve semaphores. I wouldn't be afraid of using semaphores (plural) for this, they're pretty well optimsed in VxWorks, and the fast context switch times help mimimise the degradation in performance from assessing many semaphore states, etc.

Also I would forget using POSIX semaphores, which are simply going to be layered on top of VxWork's own semaphores. VxWorks provices native counting, binary and mutex semaphores; using the one that suits makes it all a bit faster. The binary ones can be quite useful sometimes; posted to many times, never exceed the value of 1.

Second, writes being more important than reads. When I've had this kind of requirement in VxWorks and have been using a semaphore(s) to control access, I've used task priority to indicate which task is more important and should get first access to the resource. This works quite well; literally everything in VxWorks is a task (well, thread) like any other, including all the device drivers, etc.

VxWorks also resolves priority inversions (the kind of thing that Linus Torvalds hates). So if you implement your locking with a semaphore(s), you can rely on the OS scheduler to chivvy up lower priority readers if they're blocking a higher priority writer. It can lead to much simpler code, and you're getting the most of the OS too.

So a potential solution is to have a single VxWorks counting semaphore protecting the resource, initialised to a value equal to the number of readers. Each time a reader wants to read, it takes the semaphore (reducing the count by 1. Each time a read is done it posts the semaphore, increasing the count by 1. Each time the writer wants to write it takes the semaphore n (n = number of readers) times, and posts it n times when done. Finally make the writer task of higher priority than any of the readers, and rely on the OS fast context switch time and priority inversion.

Remember that you're programming on a hard-realtime OS, not Linux. Taking / posting a native VxWorks semaphore doesn't involve the same amount of runtime as a similar act on Linux, though even Linux is pretty good these days (I'm using PREEMPT_RT nowadays). The VxWorks scheduler and all the device drivers can be relied upon to behave. You can even make your writer task the highest priority in the whole system if you wish, higher even than all the device drivers!

To help things along, also consider what it is that each of your threads are doing. VxWorks allows you to indicate that a task is/isn't using the FPU. If you're using native VxWorks TaskSpawn routines instead of pthread_create then you get an opportunity to specify this. What it means is that if your thread/task isn't doing any floating point maths, and you've said as such in your call to TaskSpawn, the context switch times will be even faster because the scheduler won't bother to preserve / restore the FPU state.

This stands a reasonable chance of being the best solution on the platform you're developing on. It's playing to the OS's strengths (fast semaphores, fast context switch times) without introducing a load of extra code to recreate an alternate (and possibly more elegant) solution commonly found on other platforms.

Third, stuck with old GCC and old Boost. Basically I can't help there other than low value suggestions about phoning up WindRiver and discussing buying an upgrade. Personally speaking when I've been programming for VxWorks I've used VxWork's native API rather than POSIX. Ok, so the code hasn't be very portable, but it has ended up being fast; POSIX is merely layer on top of the native API anyway and that will always slow things down.

That said, POSIX counting and mutex semaphores are very similar to VxWork's native counting and mutex semaphores. That probably means that the POSIX layering isn't very thick.

General Notes About Programming for VxWorks

Debugging I always sought to use the development tools (Tornado) available for Solaris. This is by far the best multi-threaded debugging environment I've ever come across. Why? It allows you to start up multiple debug sessions, one for each thread/task in the system. You end up with a debug window per thread, and you are individually and independently debugging each one. Step over a blocking operation, that debug window gets blocked. Move mouse focus to another debugging window, step over the operation that will release the block and watch the first window complete its step.

You end up with a lot of debug windows, but it's by far the best way to debug multi-threaded stuff. It made it veeeeery easy to write really quite complex stuff and see problems. You can easily explore the different dynamic interactions in your application because you had simple and all powerful control over what each thread is doing at any time.

Ironically the Windows version of Tornado didn't let you do this; one miserable single debug windows per system, just like any other boring old IDE such as Visual Studio, etc. I've never seen even modern IDEs come anywhere close to being as good as Tornado on Solaris for multi-threaded debugging.

HardDrives If your readers and writers are using files on disk, consider that VxWorks 5.5 is pretty old. Things like NCQ aren't going to be supported. In this case my proposed solution (outlined above) might be better done with a single mutex semaphore to stop multiple readers tripping over each other in their struggle to read different parts of the disk. It depends on what exactly your readers are doing, but if they're reading contiguous data from a file this would avoid thrashing the read/write head to and fro across the disk surface (very slow).

In my case I was using this trick to shape traffic across a network interface; each task was sending a different sort of data, and the task priority reflected the priority of the data on the network. It was very elegant, no message was ever fragmented, but the important messages got the lions share of the available bandwidth.

Orchard answered 27/1, 2015 at 7:7 Comment(7)
Thanks, there's some very good advice in there (like considering counting semaphores). However, we cannot switch platform, and this is VxWorks 5.5, which is, I believe, from 1996. It only has very little POSIX stuff implemented, no threads (no memory protection between tasks), and it's married to the platform we're bound to. (That's where "proprietary version" comes in.) Calling Windriver won't help and the hardware manufacturer won't upgrade. Nevertheless, the counting semaphore idea alone makes this a very helpful answer. Any critique for that?Marienthal
@sbi, no worries; Yes, VxWorks 5.5. is from around about 1996. In essence everything is a thread; every task can see every other task's global symbols. Makes it very fast at context switching, because there's less to do with the MMU when doing the switch. Purely for the sake of interest VxWorks 6 did introduce memory protection. BTW what CPU platform is this? PowerPC?Orchard
@sbi, note that my proposed solution doesn't need a platform change. There are counting semaphores in VxWorks 5.5. Also please disregard my question about using PowerPC. I note from another of your posts that you're using IA32.Orchard
These boxes have flash drives. When we have to write lots of data, we use a separate task for this, because write times, though good on average, can show very bad spikes. (I was told this is inherit to flash drives.) Data mostly comes in and goes out via CAN, Modbus, Profibus, or Profinet, all handled in their own tasks, and we haven't found principal problems with our solutions there. I have found that the best way to debug heavily parallel applications is through good tracing. That we have.Marienthal
After pouring over this with a colleague, he pointed out that, since a counting semaphore cannot be taken from n times in one operation, your proposed algorithm using them would not just, say, become ineffective, but deadlock, as soon as more than one writer enters the scene. That is pretty bad. So, if there is only ever one writer, and that can safely be expressed in the code, so that the code wouldn't compile when another is introduced, a simply counting semaphore is a beautiful solution. Otherwise, however, I'd rather not create this hell for maintenance programmers.Marienthal
@sbi, no worries, that's fine. When I wrote it I presumed there was only one writer. If there's two or more then yes, it will deadlock. If they're all writing to the same resource then you will presumably still need some form of mutal exclusion between them, unless the resource being written to does that internally. Good luck!Orchard
@Marienthal Of course if you do need mutual exclusion between the writers then you could have a mutex for them. Only when a writer has acquired the mutex would it then start trying to acquire the counting semaphore. Solves the deadlock problem. Obviously if there is no need for mutual exclusion between the writers then this would simply slow all the writers down needlessly. Hope it all goes well!Orchard
C
2

As always the best solution will depend on details. A read-write spin lock may be what you're looking for, but other approaches such as read-copy-update as suggested above might be a solution - though on an old embedded platform the extra memory used might be an issue. With rare writes I often arrange the work using a tasking system such that the writes can only occur when there are no reads from that data structure, but this is algorithm dependent.

Cressler answered 23/1, 2015 at 11:19 Comment(15)
I have briefly looked at this. It seems to depend on atomics, which I don't think we have available. Also, IIUC, "spinning" means to wait busily. In a realtime environment, this you must not do, because it consumes CPU, starving tasks of lower priority. So, thanks for the suggestion, but I think this isn't what we need.Marienthal
Spin locks can use less CPU when locks are only briefly held, since there is a cost to context switching (the CPU state needs to be saved/restored). It's possible your mutexes already do a spinlock followed by a backoff standard mutex, so you may already be covered by using those. However if performance is your goal, then test and measurement should be performed to check. If you're unsure about atomic support on your system, I'd look this up.Cressler
Thanks for the explanation! We're unsure because looking up didn't reveal anything.Marienthal
GCC 4.1.2 should support the builtins __sync_val_compare_and_swap and __sync_add_and_fetch, so you could test to find out if a basic read-write lock with semaphores isn't sufficient.Cressler
Is atomicity something the hardware provides or does it have to be provided by the OS?Marienthal
@Marienthal atomics have existed since the 80486. I'm more than certain sure you'll be able to use atomics.Sienkiewicz
@Morphing: Ah, so it it's a hardware thing. I didn't know that!Marienthal
@Marienthal It might be worth getting the Pre C++11 version of the C++ Concurrency in Action book if you go this route. There's a lot to catch up on in regards to lock free programming.Sienkiewicz
I advise against using spinlocks in VxWorks. Not sure it even implements them. Given the age of the platform I wouldn't be entirely surprised if the hardware @Marienthal is using is single core.Orchard
@bazza: It certainly is single-core. :-/ So you think my suspicion regarding spinlocks was right (if maybe only accidently)?Marienthal
@Morphing: Thanks for the suggestion! Would the up-to-date version of the book (that one's already lying around here somewhere) not explain these things?Marienthal
@Marienthal it would but the examples would use new C++ features not available to you.Sienkiewicz
@Marienthal if your hardware is single core then you certainly shouldn't be using a spinlock, as you need the current thread to context switch out to allow the lock holder to complete.Cressler
@Marienthal Spinlocks on a single core really do mean that nothing else is running until the spinlock is preempted for some other reason. Gets you nowhere. The whole point of VxWork's fast scheduler is that you can use 'proper' things like semaphores with almost the same performance as something like a spinlock, yet be able to do multi-threaded things too.Orchard
Well, I was suspicious of spinlocks from the beginning. We have projects where we have a hard time getting all our communications done in the few msecs given to us per cycle, and where we use less obvious algorithmic approaches because we found that they would drop the CPU usage from 85% to 75%, so CPU usage is a deal breaker for us. So, especially in the light of bazza's undisputed comment that spinlocks are worse on single core hardware, this did not get the bonus. Thanks for answering anyway!Marienthal
V
1

One algorithm for this based on semaphores and mutexes is described in Concurrent Control with Readers and Writers; P.J. Courtois, F. Heymans, and D.L. Parnas; MBLE Research Laboratory; Brussels, Belgium.

Viewable answered 23/1, 2015 at 18:44 Comment(3)
That's a very interesting paper, and I have learned that CS papers from before the 80s are basically impossible to prove wrong. However, the reader-starving variant of this paper uses quite a number of resources, and all the answers I gathered here so far are far from being that expensive. (Mind you, this could still mean those elders were right and everybody else here has it wrong. I am just not seeing where they are.)Marienthal
Well, after a week it seems there _are_simpler approaches. The ones posted by qqibrow, Howard, and QuestionC are very similar, and use less resources. So I am not picking this answer.Marienthal
This might be a good solution (havent read it), but it is a link only answer.Cero
S
0

This is a simplified answer based on my Boost headers (I would call Boost an approved way). It only requires Condition Variables and Mutexes. I rewrote it using Windows primitives because I find them descriptive and very simple, but view this as Pseudocode.

This is a very simple solution, which does not support things like mutex upgrading, or try_lock() operations. I can add those if you want. I also took out some frills like disabling interrupts that aren't strictly necessary.

Also, it's worth checking out boost\thread\pthread\shared_mutex.hpp (this being based on that). It's human-readable.

class SharedMutex {
  CRITICAL_SECTION m_state_mutex;
  CONDITION_VARIABLE m_shared_cond;
  CONDITION_VARIABLE m_exclusive_cond;

  size_t shared_count;
  bool exclusive;

  // This causes write blocks to prevent further read blocks
  bool exclusive_wait_blocked;

  SharedMutex() : shared_count(0), exclusive(false)
  {
    InitializeConditionVariable (m_shared_cond);
    InitializeConditionVariable (m_exclusive_cond);
    InitializeCriticalSection (m_state_mutex);
  }

  ~SharedMutex() 
  {
    DeleteCriticalSection (&m_state_mutex);
    DeleteConditionVariable (&m_exclusive_cond);
    DeleteConditionVariable (&m_shared_cond);
  }

  // Write lock
  void lock(void)
  {
    EnterCriticalSection (&m_state_mutex);
    while (shared_count > 0 || exclusive)
    {
      exclusive_waiting_blocked = true;
      SleepConditionVariableCS (&m_exclusive_cond, &m_state_mutex, INFINITE)
    }
    // This thread now 'owns' the mutex
    exclusive = true;

    LeaveCriticalSection (&m_state_mutex);
  }

  void unlock(void)
  {
    EnterCriticalSection (&m_state_mutex);
    exclusive = false;
    exclusive_waiting_blocked = false;
    LeaveCriticalSection (&m_state_mutex);
    WakeConditionVariable (&m_exclusive_cond);
    WakeAllConditionVariable (&m_shared_cond);
  }

  // Read lock
  void lock_shared(void)
  {
    EnterCriticalSection (&m_state_mutex);
    while (exclusive || exclusive_waiting_blocked)
    {
      SleepConditionVariableCS (&m_shared_cond, m_state_mutex, INFINITE);
    }
    ++shared_count;
    LeaveCriticalSection (&m_state_mutex);
  }

  void unlock_shared(void)
  {
    EnterCriticalSection (&m_state_mutex);
    --shared_count;

    if (shared_count == 0)
    {
      exclusive_waiting_blocked = false;
      LeaveCriticalSection (&m_state_mutex);
      WakeConditionVariable (&m_exclusive_cond);
      WakeAllConditionVariable (&m_shared_cond);
    }
    else
    {
      LeaveCriticalSection (&m_state_mutex);
    }
  }
};

Behavior

Okay, there is some confusion about the behavior of this algorithm, so here is how it works.

During a Write Lock - Both readers and writers are blocked.

At the end of a Write Lock - Reader threads and one writer thread will race to see which one starts.

During a Read Lock - Writers are blocked. Readers are also blocked if and only if a Writer is blocked.

At the release of the final Read Lock - Reader threads and one writer thread will race to see which one starts.

This could cause readers to starve writers if the processor frequently context switches over to a m_shared_cond thread before an m_exclusive_cond during notification, but I suspect that issue is theoretical and not practical since it's Boost's algorithm.

Sp answered 23/1, 2015 at 19:42 Comment(18)
It seems this would block writers until I all readers are done, right?Marienthal
Yes. But readers can not starve writers because the exclusive_waiting_blocked flag prevents further readers from starting up.Sp
It does EnterCriticalSection on every operation. Windows' CRITICAL_SECTION is an in-process mutex with an optional initial busy-wait. Your solution is essentially a reader-writer lock based on a mutex.Bergerac
From here it looks as if unlock_shared(), when clearing the exclusive_waiting_blocked flag, releases both readers and writers, no?Marienthal
Oh, wait. It signals writers first. I hadn't seen that initially. Well, now it seems a serious bonus candidate! (That means: critique is very welcome!)Marienthal
This looks like it is probably the same algorithm as described here: open-std.org/jtc1/sc22/wg21/docs/papers/2007/… Which is what the boost implementation was derived from. And this was derived from an old newsgroup posting by Alexander Terekhov. Its main feature is that it gives neither readers nor writers priority. When nothing has the mutex and a reader and a writer simultaneously hit the mutex, the OS decides which one gets in first. When a single writer gets in, it waits for all existing readers to exit, and no new readers can get in. Nobody starves.Belton
Yes, this is boost's algorithm.Sp
@Howard: And as long as a single reader has it, new coming readers bypass a waiting writer, adding additional delay to its waiting time. At least, that's what I read from the algorithm.Marienthal
In open-std.org/jtc1/sc22/wg21/docs/papers/2007/… when a new reader and new writer simultaneously lock the mutex when one or more readers already have the lock, there is a 50/50 chance of which thread gets in: The OS decides. If the new reader gets in, the writer thread also gets in but then waits for all readers (including the new one) to exit. If the writer gets in, the new reader is blocked until the writer exits. In either event no new readers can get in while the writer is waiting. I have not confirmed that this answer is the Terekhov two-gait algorithm.Belton
Studied it. I do not believe this is the Terekhov two-gait algorithm. Adding an answer...Belton
@Howard: If that was a reply to me: The way I read this implementation (please correct me if I'm wrong here), readers will always proceed if another reader already has the lock, thereby passing by any waiting writers. This is my problem with this solution. (I think I know how to fix it, but my experience WRT concurrency is limited, and I came here for an approved solution, not for the result of my own tinkering.Marienthal
@HowardHinnant The flaw you cited for my answer isn't really there. The invariant is exclusive || exclusive_waiting_blocked. In a scenario with two writers, the one responsible for exclusive_waiting_blocked = false will safely set exclusive = true, and on unlock the woken up writer will reset exclusive_waiting_blocked = true if it lost the race to another waking up thread. There is no state violation with multiple writers, the meaning of exclusive_waiting_blocked is just nuanced.Sp
@Marienthal If a writer is blocking, then exclusive || exclusive_waiting_blocked is true, causing reads to block. I added a description of how this algorithm behaves, I hope that answers any questions.Sp
@QuestionC: On unlock the writer thread sets both exclusive and exclusive_waiting_blocked to false and sends signals. However there is no guarantee that I can see that the next woken thread isn't a reader thread that wakes up, sees both exclusive and exclusive_waiting_blocked false, and assumes ownership, even though another thread may may have previously set exclusive_waiting_blocked to true and be waiting on m_exclusive_cond.Belton
Now the mutex is in the state of being owned by a reader thread, and having a writer thread waiting on m_exclusive_cond, and yet exclusive_wait_blocked == false. At this point my confidence fails.Belton
@HowardHinnant Stop obsessing over exclusive_wait_blocked, control is based on exclusive || exclusive_waiting_blocked. When m_exclusive_cond is notified, there are only two possible control paths (leave the loop or stay in it) and each path sets one of those variables. I don't know how to explain this besides resorting to boxes and arrows. And really, are you more confident in your own analysis than Boost?Sp
@QuestionC: I'm pretty sure the other boost folk listens very carefully every time Howard critiques their shared mutex, and they have good reason to do so.Marienthal
While I also believe that this algorithm works, and @Howard is wrong here, the co-worker and I who have been looking at this also believe that the fact that exclusive_wait_blocked is a boolean, rather than an unsigned integer, is a disadvantage – if for no other reason then because it makes the code harder to reason about. This is why this answer will not get the bonus.Marienthal
T
0

Now that Microsoft has opened up the .NET source code, you can look at their ReaderWRiterLockSlim implementation.

I'm not sure the more basic primitives they use are available to you, some of them are also part of the .NET library and their code is also available.

Microsoft has spent quite a lot of time on improving the performance of their locking mechanisms, so this can be a good starting point.

Trapeziform answered 30/1, 2015 at 8:18 Comment(1)
I have very briefly looked at this. This seems to use a dozen or two member variables, and therefore is everything but lightweight. I did not look much further.Marienthal

© 2022 - 2024 — McMap. All rights reserved.