Compute Capability 7.x (Volta) update:
With the introduction of Independent Thread Scheduling among threads in a warp, CUDA is finally more strict in practice, now matching documented behavior. From the Programming Guide:
Although __syncthreads() has been consistently documented as synchronizing all threads in the thread block, Pascal and prior architectures could only enforce synchronization at the warp level. In certain cases, this allowed a barrier to succeed without being executed by every thread as long as at least some thread in every warp reached the barrier. Starting with Volta, the CUDA built-in __syncthreads() and PTX instruction bar.sync (and their derivatives) are enforced per thread and thus will not succeed until reached by all non-exited threads in the block. Code exploiting the previous behavior will likely deadlock and must be modified to ensure that all non-exited threads reach the barrier.
Below is the previous answer, which rambled about pre-Volta behavior.
Update: This answer may not add anything on top of talonmies' (depending on your understanding of the subject, I suppose), but at the risk of being too verbose I'm presenting the information that helped me understand this better. Also, if you are not interested in how things might work "under the hood" or what might be possible beyond the official documentation, there's nothing to see here. That all said, I still don't recommend making assumptions beyond what is officially documented, especially in an environment that hopes to support multiple or future architectures. I primarily wanted to point out that while this is explicitly called out as bad practice by the CUDA Programming Guide, the actual behavior of __syncthreads()
may be somewhat different from how it is described and to me that is interesting. The last thing I want is to spread misinformation, so I'm open to discussion and revising my answer!
A few important facts
There is no TL;DR for this answer as there is too much potential for misinterpretation, but here are some relevant facts to start:
__syncthreads()
behaves like a barrier for warps in a block rather than all of the threads in a block, although when used as advised it amounts to the same thing.
- If any thread in a warp executes a PTX
bar
instruction (e.g. from _syncthreads
), it is as if all the threads in the warp have.
- When a
bar.sync
is called (as generated by the instrinsic __syncthreads()
), the arrival count for that block and barrier are incremented by the warp size. This is how the previous points are achieved.
- Thread divergence (multiple paths) is handled by serializing the execution of the branches. The order of serialization is a factor that can cause trouble.
- The threads within a warp are not synchronized by
__syncthreads()
. The instruction will not cause the warp to stall and wait for the threads on divergent paths. Branch execution is serialized, so only when the branches rejoin or the code terminates do the threads in the warp then resynchronize. Until that, the branches run in sequence and independently. Again, only one thread in each warp of the block needs to hit __syncthreads()
for execution to continue.
These statements are supported by official documentation and other sources.
Interpretation and documentation
Since __syncthreads()
acts as a barrier for warps in a block rather than all of the threads in a block, as it is described in the Programming Guide, it seems that a simple early exit would be fine if at least one thread in each warp hits the barrier. (But that is not to say you can't cause deadlocks with the intrinsic!) This also supposes that __syncthreads()
will always generate a simple bar.sync a;
PTX instruction and that the semantics of that will not change either, so don't do this in production.
One interesting study that I came across actually investigates what happens when you go against the recommendations of the CUDA Programming Guide, and they found that while it is indeed possible to cause a deadlock by abusing __syncthreads()
in conditional blocks, not all use of the intrinsic in conditional code will do so. From Section D.1 in the paper:
The Programming Guide recommends that syncthreads() be used in conditional code only if the condition evaluates identically across the entire thread block. The rest of this section investigates the behavior of syncthreads() when this recommendation is violated. We demonstrate that syncthreads() operates as a barrier for warps, not threads. We show that when threads of a warp are serialized due to branch divergence, any syncthreads() on one path does not wait for threads from the other path, but only waits for other warps running within the same thread block.
This statement is concordant with the bit of the PTX documentation quoted by talonmies. Specifically:
Barriers are executed on a per-warp basis as if all the threads in a warp are active. Thus, if any thread in a warp executes a bar instruction, it is as if all the threads in the warp have executed the bar instruction. All threads in the warp are stalled until the barrier completes, and the arrival count for the barrier is incremented by the warp size (not the number of active threads in the warp).
It is clear from this why the optional thread count b
in the bar.sync a{, b};
instruction must be a multiple of warp size -- whenever a single thread in a warp executes a bar
instruction the arrival count is incremented by the warp size, not the number of threads in the warp that actually hit the barrier. Threads that terminate early (followed a different path) were effectively counted as arrived anyway. Now, the next sentence in the quoted passage does then say not to use __syncthreads()
in conditional code unless "it is known that all threads evaluate the condition identically (the warp does not diverge)." This seems to be an overly strict recommendation (for current architecture), meant to ensure that the arrival count actually reflects the real number of threads that hit the barrier. If at least one thread hitting the barrier increments the arrival count for the entire warp, you might really have a little more flexibility.
There is no ambiguity in the PTX documentation that the bar.sync a;
instruction generated by __syncthreads()
waits for all threads in the current cooperative thread array (block) to reach barrier a
. However, the point is that how "all threads" is presently determined by incrementing the arrival count in multiples of warp size whenever the barrier is hit (by default when b
is not specified). This part is not undefined behavior, at least not with Parallel Thread Execution ISA Version 4.2.
Keep in mind that there may be inactive threads in a warp even without a conditional -- "the last threads of a block whose number of threads is not a multiple of the warp size." (SIMT architecture notes). Yet __syncthreads()
is not forbidden in such blocks.
Examples
Early exit version 1:
__global__ void kernel(...)
if (tidx >= N)
return; // OK for <32 threads to hit this, but if ALL
// threads in a warp hit this, THEN you are deadlocked
// (assuming there are other warps that sync)
__syncthreads(); // If at least one thread on this path reaches this, the
// arrival count for this barrier is incremented by
// the number of threads in a warp, NOT the number of
// threads that reach this in the current warp.
}
This will not deadlock if at least one thread per warp hits the sync, but a possible issue is order of serialization of the execution of divergent code paths. You can change around the above kernel to effectively swap the branches.
Early exit version 2:
__global__ void kernel(...)
if (tidx < N) {
// do stuff
__syncthreads();
}
// else return;
}
Still no deadlock if you have at least one thread in the warp hit the barrier, but is the order of branch execution important in this case? I don't think so, but it's probably a bad idea to require a particular execution order.
The paper demonstrates this in a more involved example compared to a trivial early exit that also reminds us to be cautious around warp divergence. Here the first half of the warp (thread id tid
on [0,15]) writes to some shared memory and executes __syncthreads()
, while the other half (thread id tid
on [16,31]) also executes __syncthreads()
but now reads from the shared memory locations written by the first half of the warp. Ignoring the shared memory test at first, you might expect a deadlock at either barrier.
// incorrect code to demonstrate behavior of __syncthreads
if (tid < 16 ) {
shared_array[tid] = tid;
__syncthreads();
}
else {
__syncthreads();
output[tid] =
shared_array[tid%16];
}
There is no deadlock, indicating that __syncthreads()
does not synchronize diverged threads within a warp. Divergent code paths are serialized in a warp and it only takes one thread in a code path to make the call to __syncthreads()
work at the per-warp level.
However, the shared memory bit shows where some unpredictable behavior can enter into this. The second half of the warp does not get the updated values from the first half because branch divergence serialized execution of the warp and the else block was executed first. So the function doesn't work right, but it also show that __syncthreads()
does not synchronize divergent threads in a warp.
Summary
__syncthreads()
does not wait for all threads in a warp, and the arrival of a single thread in a warp effectively counts the entire warp as having reached the barrier. (Present architecture).
It can be dangerous to use __syncthreads()
in conditional code because of how divergent thread execution is serialized.
Use the intrinsic in conditional code only if you understand how it works and how branch divergence (which occurs within a warp) is handled.
Note that I didn't say to go ahead and use __syncthreads()
in a way inconsistent with how it is documented.