Looking at the pthread_cond_signal()
implementation, there is a comment briefly explaining what the code does:
Load the waiter sequence number, which represents our relative ordering to any waiters. Relaxed MO is sufficient for that because:
- We can pick any position that is allowed by external happens-before constraints. In particular, if another
__pthread_cond_wait
call happened before us, this waiter must be eligible for being woken by us. The only way do establish such a happens-before is by signaling while having acquired the mutex associated with the condvar
and ensuring that the signal's critical section happens after the waiter. Thus, the mutex ensures that we see that waiter's __wseq
increase.
- Once we pick a position, we do not need to communicate this to the program via a happens-before that we set up: First, any wake-up could be a spurious wake-up, so the program must not interpret a wake-up as an indication that the waiter happened before a particular signal; second, a program cannot detect whether a waiter has not yet been woken (i.e., it cannot distinguish between a non-woken waiter and one that has been woken but hasn't resumed execution yet), and thus it cannot try to deduce that a signal happened before a particular waiter.
As mentioned, there is more information about the algorithm in the pthread_cond_wait()
implementation:
This condvar
implementation guarantees that all calls to signal and broadcast and all of the three virtually atomic parts of each call to wait (i.e., (1) releasing the mutex and blocking, (2) unblocking, and (3) re- acquiring the mutex) happen in some total order that is consistent with the happens-before relations in the calling program. However, this order does not necessarily result in additional happens-before relations being established (which aligns well with spurious wake-ups being allowed).
All waiters acquire a certain position in a 64b waiter sequence (__wseq
). This sequence determines which waiters are allowed to consume signals. A broadcast is equal to sending as many signals as are unblocked waiters. When a signal arrives, it samples the current value of __wseq
with a relaxed-MO load (i.e., the position the next waiter would get). (This is sufficient because it is consistent with happens-before; the caller can enforce stronger ordering constraints by calling signal while holding the mutex.) Only waiters with a position less than the __wseq
value observed by the signal are eligible to consume this signal.
This would be straight-forward to implement if waiters would just spin but we need to let them block using futexes. Futexes give no guarantee of waking in FIFO order, so we cannot reliably wake eligible waiters if we just use a single futex. Also, futex words are 32b in size, but we need to distinguish more than 1<<32 states because we need to represent the order of wake-up (and thus which waiters are eligible to consume signals); blocking in a futex is not atomic with a waiter determining its position in the waiter sequence, so we need the futex word to reliably notify waiters that they should not attempt to block anymore because they have been already signaled in the meantime. While an ABA issue on a 32b value will be rare, ignoring it when we are aware of it is not the right thing to do either.
Therefore, we use a 64b counter to represent the waiter sequence (on architectures which only support 32b atomics, we use a few bits less). To deal with the blocking using futexes, we maintain two groups of waiters:
- Group G1 consists of waiters that are all eligible to consume signals; incoming signals will always signal waiters in this group until all waiters in G1 have been signaled.
- Group G2 consists of waiters that arrive when a G1 is present and still contains waiters that have not been signaled. When all waiters in G1 are signaled and a new signal arrives, the new signal will convert G2 into the new G1 and create a new G2 for future waiters.
We cannot allocate new memory because of process-shared condvars
, so we have just two slots of groups that change their role between G1 and G2. Each has a separate futex word, a number of signals available for consumption, a size (number of waiters in the group that have not been signaled), and a reference count.
The group reference count is used to maintain the number of waiters that are using the group's futex. Before a group can change its role, the reference count must show that no waiters are using the futex anymore; this prevents ABA issues on the futex word.
To represent which intervals in the waiter sequence the groups cover (and thus also which group slot contains G1 or G2), we use a 64b counter to designate the start position of G1 (inclusive), and a single bit in the waiter sequence counter to represent which group slot currently contains G2. This allows us to switch group roles atomically wrt. waiters obtaining a position in the waiter sequence. The G1 start position allows waiters to figure out whether they are in a group that has already been completely signaled (i.e., if the current G1 starts at a later position that the waiter's position). Waiters cannot determine whether they are currently in G2 or G1 -- but they do not have too because all they are interested in is whether there are available signals, and they always start in G2 (whose group slot they know because of the bit in the waiter sequence. Signalers will simply fill the right group until it is completely signaled and can be closed (they do not switch group roles until they really have to to decrease the likelihood of having to wait for waiters still holding a reference on the now-closed G1).
Signalers maintain the initial size of G1 to be able to determine where G2 starts (G2 is always open-ended until it becomes G1). They track the remaining size of a group; when waiters cancel waiting (due to PThreads cancellation or timeouts), they will decrease this remaining size as well.
To implement condvar
destruction requirements (i.e., that pthread_cond_destroy
can be called as soon as all waiters have been signaled), waiters increment a reference count before starting to wait and decrement it after they stopped waiting but right before they acquire the mutex associated with the condvar
.
pthread_cond_t
thus consists of the following (bits that are used for flags and are not part of the primary value of each field but necessary to make some things atomic or because there was no space for them elsewhere in the data structure):
__wseq
: Waiter sequence counter
- LSB is index of current G2.
- Waiters fetch-add while having acquire the mutex associated with the
condvar
. Signalers load it and fetch-xor it concurrently.
__g1_start
: Starting position of G1 (inclusive)
- LSB is index of current G2.
- Modified by signalers while having acquired the
condvar
-internal lock and observed concurrently by waiters.
__g1_orig_size
: Initial size of G1
- The two least-significant bits represent the
condvar
-internal lock.
- Only accessed while having acquired the
condvar
-internal lock.
__wrefs
: Waiter reference counter.
- Bit 2 is true if waiters should run futex_wake when they remove the last reference.
pthread_cond_destroy
uses this as futex word.
- Bit 1 is the clock ID (
0 == CLOCK_REALTIME
, 1 == CLOCK_MONOTONIC
).
- Bit 0 is true iff this is a process-shared
condvar
.
- Simple reference count used by both waiters and pthread_cond_destroy. (If the format of
__wrefs
is changed, update nptl_lock_constants.pysym
and the pretty printers.)
For each of the two groups, we have:
__g_refs
: Futex waiter reference count.
- LSB is true if waiters should run
futex_wake
when they remove the last reference.
- Reference count used by waiters concurrently with signalers that have acquired the condvar-internal lock.
__g_signals
: The number of signals that can still be consumed.
- Used as a futex word by waiters. Used concurrently by waiters and signalers.
- LSB is true iff this group has been completely signaled (i.e., it is closed).
__g_size
: Waiters remaining in this group (i.e., which have not been
signaled yet.
- Accessed by signalers and waiters that cancel waiting (both do so only when having acquired the
condvar
-internal lock.
- The size of G2 is always zero because it cannot be determined until the group becomes G1.
- Although this is of unsigned type, we rely on using unsigned overflow rules to make this hold effectively negative values too (in particular, when waiters in G2 cancel waiting).
A PTHREAD_COND_INITIALIZER
condvar
has all fields set to zero, which yields a condvar
that has G2 starting at position 0 and a G1 that is closed.
Because waiters do not claim ownership of a group right when obtaining a position in __wseq
but only reference count the group when using futexes to block, it can happen that a group gets closed before a waiter can increment the reference count. Therefore, waiters have to check whether their group is already closed using __g1_start
. They also have to perform this check when spinning when trying to grab a signal from __g_signals
. Note that for these checks, using relaxed MO to load __g1_start
is sufficient because if a waiter can see a sufficiently large value, it could have also consume a signal in the waiters group.
Waiters try to grab a signal from __g_signals
without holding a reference count, which can lead to stealing a signal from a more recent group after their own group was already closed. They cannot always detect whether they in fact did because they do not know when they stole, but they can conservatively add a signal back to the group they stole from; if they did so unnecessarily, all that happens is a spurious wake-up. To make this even less likely, __g1_start
contains the index of the current g2 too, which allows waiters to check if there aliasing on the group slots; if there wasn't, they didn't steal from the current G1, which means that the G1 they stole from must have been already closed and they do not need to fix anything.
It is essential that the last field in pthread_cond_t
is __g_signals[1]
: The previous condvar
used a pointer-sized field in pthread_cond_t
, so a PTHREAD_COND_INITIALIZER
from that condvar
implementation might only initialize 4 bytes to zero instead of the 8 bytes we need (i.e., 44 bytes in total instead of the 48 we need). __g_signals[1]
is not accessed before the first group switch (G2 starts at index 0), which will set its value to zero after a harmless fetch-or whose return value is ignored. This effectively completes initialization.
Limitations:
- This
condvar
isn't designed to allow for more than __PTHREAD_COND_MAX_GROUP_SIZE * (1 << 31)
calls to __pthread_cond_wait
.
- More than
__PTHREAD_COND_MAX_GROUP_SIZE
concurrent waiters are not supported.
- Beyond what is allowed as errors by POSIX or documented, we can also return the following errors:
EPERM
if MUTEX is a recursive mutex and the caller doesn't own it.
EOWNERDEAD
or ENOTRECOVERABLE
when using robust mutexes. Unlike for other errors, this can happen when we re-acquire the mutex; this isn't allowed by POSIX (which requires all errors to virtually happen before we release the mutex or change the condvar
state), but there's nothing we can do really.
- When using
PTHREAD_MUTEX_PP_* mutexes
, we can also return all errors returned by __pthread_tpp_change_priority
. We will already have released the mutex in such cases, so the caller cannot expect to own MUTEX.
Other notes:
- Instead of the normal mutex unlock / lock functions, we use __pthread_mutex_unlock_usercnt(m, 0) / __pthread_mutex_cond_lock(m) because those will not change the mutex-internal users count, so that it can be detected when a condvar is still associated with a particular mutex because there is a waiter blocked on this condvar using this mutex.
From that documentation we learn you can call the pthread_cond_signal()
and pthread_cond_broadcast()
from anywhere. If you call these functions from outside a lock, then you don't have a very strong guarantee other than:
- The
pthread_cond_signal()
will wake up at least one thread, but if two calls get there simultaneously, then the same thread could be picked up twice.
- The
pthread_cond_broadcast()
will wake up all the threads, no matter what.
However, if you are using a mutex and calling the pthread_cond_signal()
from within the locked area, then it will wake up one thread per call. (Note, however, that all your pthread_cond_signal()
should be protected.)
So the following code would be considered safe:
pthread_mutex_lock(mutex);
...
pthread_cond_signal(cond1); // no mutex reference, these calls could happen
pthread_cond_signal(cond1); // while the mutex is not locked...
pthread_cond_unlock(mutex);
And the wait also using a lock:
pthread_mutex_lock(mutex);
...
pthread_cond_wait(cond1, mutex);
...
pthread_mutex_unlock(mutex);
Since we're using a mutex which is locked, the signal & the wait are internally processed sequentially and thus it works exactly as expected.
There are limits, though, but probably nothing we can ever really reach in a normal app. For example __PTHREAD_COND_MAX_GROUP_SIZE
which represents the maximum number of waiters and is a crazy large number:
#define __PTHREAD_COND_MAX_GROUP_SIZE ((unsigned) 1 << 29)