Can a correct fail-safe process-shared barrier be implemented on Linux?
Asked Answered
S

3

11

In a past question, I asked about implementing pthread barriers without destruction races:

How can barriers be destroyable as soon as pthread_barrier_wait returns?

and received from Michael Burr with a perfect solution for process-local barriers, but which fails for process-shared barriers. We later worked through some ideas, but never reached a satisfactory conclusion, and didn't even begin to get into resource failure cases.

Is it possible on Linux to make a barrier that meets these conditions:

  • Process-shared (can be created in any shared memory).
  • Safe to unmap or destroy the barrier from any thread immediately after the barrier wait function returns.
  • Cannot fail due to resource allocation failure.

Michael's attempt at solving the process-shared case (see the linked question) has the unfortunate property that some kind of system resource must be allocated at wait time, meaning the wait can fail. And it's unclear what a caller could reasonably do when a barrier wait fails, since the whole point of the barrier is that it's unsafe to proceed until the remaining N-1 threads have reached it...

A kernel-space solution might be the only way, but even that's difficult due to the possibility of a signal interrupting the wait with no reliable way to resume it...

Schaerbeek answered 4/8, 2011 at 3:8 Comment(9)
I'm not sure I understand your question really, from what I can tell, the NPTL implementation is fail safe. The barrier itself is allocated externally, its pthread_barrier_destroy function only synchronizes with the last waiter. what's the issue, exactly?Sandlin
@Hasturkun: See glibc bug #13065: sourceware.org/bugzilla/show_bug.cgi?id=13065Schaerbeek
Although the attachment there is missing, it appears to describe a case where the memory is unmapped before the barrier is destroyed. If I understand correctly, it is never safe to unmap the barrier before destroying it (unless you can guarantee that there are no threads of yours using the barrier, assuming you are not the last user). the NPTL implementation never accesses anything without the lock, which pthread_barrier_destroy takes, assuring that nobody accesses the barrier internals when the destroy succeeds (which it won't do if any thread is waiting).Sandlin
It is safe to unmap a barrier before it's destroyed as long as other destroyable mappings (e.g. in other processes) remain. This would be a standard usage for process-shared synchronization objects where one process only needs access to it for a fixed time.Schaerbeek
It would be safe if you can guarantee that your process isn't using it, in which case there really shouldn't be an issue. I don't see any reason to expect any sort of behavior if you decide to free a barrier's (or any other primitive's) underlying memory before the relevant functions have returned. You should synchronize any users before unmapping the memory, either explicitly, or by correctly using pthread_barrier_destroy. In the latter case, you can later re-initialize the barrier for reuse.Sandlin
Btw, I think the NPTL implementation would qualify for unmap if you changed the 2nd condition to be "all barrier wait functions return". It currently only assures safety if a single thread (eg. PTHREAD_BARRIER_SERIAL_THREAD) destroys and then unlinks.Sandlin
You're reading the spec the way a lazy implementor would choose to read it rather than at face value (and as someone wanting to write applications would read it)...Schaerbeek
I think this could be solved by having the waiting threads requeue on a process local barrier (or equivalent) before returning, with the last to leave waking the others up. this would guarantee that your process is no longer accessing the barrier object when the wait returns, thus making it safe to unmap.Sandlin
The problem is where to store the data necessary for them to negotiate a process-local barrier to wait on. Inside the barrier does not work because there could be arbitrarily many processes using it. Global data in the process does not work without >O(1) search structures being used.Schaerbeek
S
1

After a long discussion with bdonlan on SO chat, I think I have a solution. Basically, we break the problem down into the two self-synchronized deallocation issues: the destroy operation and unmapping.

Handling destruction is easy: Simply make the pthread_barrier_destroy function wait for all waiters to stop inspecting the barrier. This can be done by having a usage count in the barrier, atomically incremented/decremented on entry/exit to the wait function, and having the destroy function spin waiting for the count to reach zero. (It's also possible to use a futex here, rather than just spinning, if you stick a waiter flag in the high bit of the usage count or similar.)

Handling unmapping is also easy, but non-local: ensure that munmap or mmap with the MAP_FIXED flag cannot occur while barrier waiters are in the process of exiting, by adding locking to the syscall wrappers. This requires a specialized sort of reader-writer lock. The last waiter to reach the barrier should grab a read lock on the munmap rw-lock, which will be released when the final waiter exits (when decrementing the user count results in a count of 0). munmap and mmap can be made reentrant (as some programs might expect, even though POSIX doesn't require it) by making the writer lock recursive. Actually, a sort of lock where readers and writers are entirely symmetric, and each type of lock excludes the opposite type of lock but not the same type, should work best.

Schaerbeek answered 27/9, 2011 at 4:56 Comment(2)
A link to the chat for posterity: chat.stackoverflow.com/rooms/3811/…Stagnate
I've implemented the design we discussed and it seems to be working. It passed my basic tests for unmap/destroy race conditions, and the NPTL barrier tests, but i haven't subjected it to heavy stress tests yet. Performance suffers a bit since all waiters actually have to wait twice (a second time once they're all at the barrier so they can all take the munmap lock before exiting) but I'm still using the old design (courtesy of Michael Burr) for non-pshared barriers.Schaerbeek
S
2

This is not possible with the Linux futex API, and I think this can be proven as well.

We have here essentially a scenario in which N processes must be reliably awoken by one final process, and further no process may touch any shared memory after the final awakening (as it may be destroyed or reused asynchronously). While we can awaken all processes easily enough, the fundamental race condition is between the wakeup and the wait; if we issue the wakeup before the wait, the straggler never wakes up.

The usual solution to something like this is to have the straggler check a status variable atomically with the wait; this allows it to avoid sleeping at all if the wakeup has already occurred. However, we cannot do this here - as soon as the wakeup becomes possible, it is unsafe to touch shared memory!

One other approach is to actually check if all processes have gone to sleep yet. However, this is not possible with the Linux futex API; the only indication of number of waiters is the return value from FUTEX_WAKE; if it returns less than the number of waiters you expected, you know some weren't asleep yet. However, even if we find out we haven't woken enough waiters, it's too late to do anything - one of the processes that did wake up may have destroyed the barrier already!

So, unfortunately, this kind of immediately-destroyable primitive cannot be constructed with the Linux futex API.

Note that in the specific case of one waiter, one waker, it may be possible to work around the problem; if FUTEX_WAKE returns zero, we know nobody has actually been awoken yet, so you have a chance to recover. Making this into an efficient algorithm, however, is quite tricky.

It's tricky to add a robust extension to the futex model that would fix this. The basic problem is, we need to know when N threads have successfully entered their wait, and atomically awaken them all. However, any of those threads may leave the wait to run a signal handler at any time - indeed, the waker thread may also leave the wait for signal handlers as well.

One possible way that may work, however, is an extension to the keyed event model in the NT API. With keyed events, threads are released from the lock in pairs; if you have a 'release' without a 'wait', the 'release' call blocks for the 'wait'.

This in itself isn't enough due to the issues with signal handlers; however, if we allow for the 'release' call to specify a number of threads to be awoken atomically, this works. You simply have each thread in the barrier decrement a count, then 'wait' on a keyed event on that address. The last thread 'releases' N - 1 threads. The kernel doesn't allow any wake event to be processed until all N-1 threads have entered this keyed event state; if any thread leaves the futex call due to signals (including the releasing thread), this prevents any wakeups at all until all threads are back.

Stagnate answered 24/9, 2011 at 3:32 Comment(8)
Here's an idea: the FUTEX_REQUEUE operation can be used (with identical source and dest address and 0 wakes) to query how many waiters a futex has (it returns the sum of the number woken and the number requeued). It can also (presumably) be used to requeue a waiter onto a futex not even visible in the waiter's address space. Perhaps these operations are enough to implement something like keyed events? The only big problem, I think, is what happens when the waiter is interrupted by a signal. Even if you blocked all signals (not quite legal...) there's still SIGSTOP...Schaerbeek
Querying the number of waiters is inherently racy - the value could change the instant the syscall returns, so it's useless as anything other than a performance optimization. And even if you start moving waiters one-at-a-time to a stack-local futex variable on the waker's side, the waiters could be awoken by a signal, and would then return to their original wait. No, I think this really needs a new kernel-side operation.Stagnate
Are you sure they'd restart the wait on their original queue if interrupted by a signal? Maybe so; I'd have to check. I rather expected the futex wait syscall to fail with EAGAIN or return 0 (wake) if interrupted after being requeued, similar to how read or write would return a partial transfer if interrupted after some data is read/written. Of course to be sure I should really test it...Schaerbeek
The other idea I had was for the pthread_barrier_init operation for a process-shared barrier to create a new process that would be the permanent barrier manager, and communicate over the futex. The final wakeup could be controlled by querying some filesystem object under the control of the manager process - something that could be queried without having to allocate resources, such as access. Of course since there seems to be no way to block waiting for such an event, it would be desirable to optimize the wait with futex ops (the above idea) and only use the fs for the "final check".Schaerbeek
@R.., yeah, it'll fail and return EAGAIN. At which point the barrier code has to go back to waiting on the original queue. Where else will it wait, after all? Then consider the race where, just as the waker is issuing the FUTEX_WAKE on the private queue and returning from the barrier function, the sleeper is awoken by a signal and goes back to waiting on the original queue - even though it's no longer safe to access that memory.Stagnate
That's why I was thinking of only re-accessing the futex if access with R_OK on a filesystem object under the control of the manager process succeeded. The manager would change it between mode 444 and mode 000 depending on the state to control what access would return. As far as I know, neither fchmod nor access should fail or return incorrect results due to resource issues.Schaerbeek
Yeah, but if you allow access to external objects there are far simpler solutions - eg, map a file from this common area with just a counter, when it hits zero, wakeup all, unlink, and each process unmaps on its own. Or even just use a unix domain socket - first process in listens and acts as a master.Stagnate
let us continue this discussion in chatStagnate
S
1

After a long discussion with bdonlan on SO chat, I think I have a solution. Basically, we break the problem down into the two self-synchronized deallocation issues: the destroy operation and unmapping.

Handling destruction is easy: Simply make the pthread_barrier_destroy function wait for all waiters to stop inspecting the barrier. This can be done by having a usage count in the barrier, atomically incremented/decremented on entry/exit to the wait function, and having the destroy function spin waiting for the count to reach zero. (It's also possible to use a futex here, rather than just spinning, if you stick a waiter flag in the high bit of the usage count or similar.)

Handling unmapping is also easy, but non-local: ensure that munmap or mmap with the MAP_FIXED flag cannot occur while barrier waiters are in the process of exiting, by adding locking to the syscall wrappers. This requires a specialized sort of reader-writer lock. The last waiter to reach the barrier should grab a read lock on the munmap rw-lock, which will be released when the final waiter exits (when decrementing the user count results in a count of 0). munmap and mmap can be made reentrant (as some programs might expect, even though POSIX doesn't require it) by making the writer lock recursive. Actually, a sort of lock where readers and writers are entirely symmetric, and each type of lock excludes the opposite type of lock but not the same type, should work best.

Schaerbeek answered 27/9, 2011 at 4:56 Comment(2)
A link to the chat for posterity: chat.stackoverflow.com/rooms/3811/…Stagnate
I've implemented the design we discussed and it seems to be working. It passed my basic tests for unmap/destroy race conditions, and the NPTL barrier tests, but i haven't subjected it to heavy stress tests yet. Performance suffers a bit since all waiters actually have to wait twice (a second time once they're all at the barrier so they can all take the munmap lock before exiting) but I'm still using the old design (courtesy of Michael Burr) for non-pshared barriers.Schaerbeek
D
0

Well, I think I can do it with a clumsy approach...

Have the "barrier" be its own process listening on a socket. Implement barrier_wait as:

open connection to barrier process
send message telling barrier process I am waiting
block in read() waiting for reply

Once N threads are waiting, the barrier process tells all of them to proceed. Each waiter then closes its connection to the barrier process and continues.

Implement barrier_destroy as:

open connection to barrier process
send message telling barrier process to go away
close connection

Once all connections are closed and the barrier process has been told to go away, it exits.

[Edit: Granted, this allocates and destroys a socket as part of the wait and release operations. But I think you can implement the same protocol without doing so; see below.]

First question: Does this protocol actually work? I think it does, but maybe I do not understand the requirements.

Second question: If it does work, can it be simulated without the overhead of an extra process?

I believe the answer is "yes". You can have each thread "take the role of" the barrier process at the appropriate time. You just need a master mutex, held by whichever thread is currently "taking the role" of the barrier process. Details, details... OK, so the barrier_wait might look like:

lock(master_mutex);
++waiter_count;
if (waiter_count < N)
    cond_wait(master_condition_variable, master_mutex);
else
    cond_broadcast(master_condition_variable);
--waiter_count;
bool do_release = time_to_die && waiter_count == 0;
unlock(master_mutex);
if (do_release)
    release_resources();

Here master_mutex (a mutex), master_condition_variable (a condition variable), waiter_count (an unsigned integer), N (another unsigned integer), and time_to_die (a Boolean) are all shared state allocated and initialized by barrier_init. waiter_count is initialiazed to zero, time_to_die to false, and N to the number of threads the barrier is waiting for.

Then barrier_destroy would be:

lock(master_mutex);
time_to_die = true;
bool do_release = waiter_count == 0;
unlock(master_mutex);
if (do_release)
    release_resources();

Not sure about all the details concerning signal handling etc... But the basic idea of "last one out turns off the lights" is workable, I think.

Distance answered 4/8, 2011 at 4:37 Comment(9)
"Open connection" is an operation that can fail due to too many open fds in the process or globally on the system. :-( I have a much lighter (no extra process needed) way to do a barrier just having the first waiter create a fifo and subsequent waiters block trying to open it, but this still depends on the ability to successfully create a fifo. I suppose one could go into an infinite loop waiting for it to succeed...Schaerbeek
Second half of my answer does not allocate any resources except during init... Does it work?Distance
Your approach protects against the case where the race is between returning from the barrier and destroying it, but I don't think it helps at all when the caller unmaps the barrier without destroying it. This of course is a typical usage for process-shared barriers where you may want to leave it around for other processes to use.Schaerbeek
@R..: Why would the caller unmap the barrier without destroying it first?Sandlin
Because there are many processes using the barrier. It might belong to a server process you're done communicating with, but which still needs the barrier for later communication with other processes.Schaerbeek
@R: With my solution, map/unmap is an orthogonal problem... All you have to do is figure out how to manage creation/destruction of any shared memory and you are done. Not necessarily trivial, but it has nothing to do with barriers per se.Distance
@R: For example, the process that creates the barrier could use shm_open+ftruncate+mmap to obtain the shared memory and then immediately shm_unlink it. Then it just passes the file descriptor to any process needing access to the barrier. Said process does mmap() and then close() on the descriptor. Master does close() once all processes have been given the fd. Any process does munmap() whenever they want to. Kernel will release shared memory segment once the last process has unmapped it.Distance
If you're going to do that, the first waiter has to be the one to create the shared memory and pass it on to subsequent waiters, much like Michael's approach with an automatic-storage-duration instance context but with allocation of shared memory. You can't allocate the resource like you described in the barrier init function because (1) it won't have a name when subsequent threads try to access it, and (2) you can't ensure that permissions would match the permissions to the semaphore shared memory.Schaerbeek
@R: But these are fundamental concerns... somebody has to initialize the barrier. That only happens once. And only after that does it make sense to use the barrier. So there is already a "master" process that must initialize the barrier and pass it to the other processes, or somehow notify them that it is ready to use...Distance

© 2022 - 2024 — McMap. All rights reserved.