Can I use __syncthreads() after having dropped threads?
Asked Answered
A

3

44

Is it safe to use __syncthreads() in a block where I have purposefully dropped threads using return?

The documentation states that __syncthreads() must be called by every thread in the block or else it will lead to a deadlock, but in practice I have never experienced such behavior.

Sample code:

__global__ void kernel(float* data, size_t size) {
    // Drop excess threads if user put too many in kernel call.
    // After the return, there are `size` active threads.
    if (threadIdx.x >= size) {
        return;
    }

    // ... do some work ...

    __syncthreads(); // Is this safe?

    // For the rest of the kernel, we need to drop one excess thread
    // After the return, there are `size - 1` active threads
    if (threadIdx.x + 1 == size) {
        return;
    }

     // ... do more work ...

    __syncthreads(); // Is this safe?
}
Adhibit answered 12/7, 2011 at 15:4 Comment(0)
P
37

The answer to the short question is "No". Warp level branch divergence around a __syncthreads() instruction will cause a deadlock and result in a kernel hang. Your code example is not guaranteed to be safe or correct. The correct way to implement the code would be like this:

__global__ void kernel(...)

    if (tidx < N) {
        // Code stanza #1
    }

    __syncthreads();


    if (tidx < N) {
        // Code stanza #2
    }

    // etc
}

so that the __syncthreads() instructions are executed unconditionally.


EDIT: Just to add a bit of additional information which confirms this assertion, __syncthreads() calls get compiled into the PTX bar.sync instruction on all architectures. The PTX2.0 guide (p133) documents bar.sync and includes the following warning:

Barriers are executed on a per-warp basis as if all the threads in a warp are active. Thus, if any thread in a warp executes a bar instruction, it is as if all the threads in the warp have executed the bar instruction. All threads in the warp are stalled until the barrier completes, and the arrival count for the barrier is incremented by the warp size (not the number of active threads in the warp). In conditionally executed code, a bar instruction should only be used if it is known that all threads evaluate the condition identically (the warp does not diverge). Since barriers are executed on a per-warp basis, the optional thread count must be a multiple of the warp size.

So despite any assertions to the contrary, it is not safe to have conditional branching around a __syncthreads() call unless you can be 100% certain that every thread in any given warp follows the same code path and no warp divergence can occur.

Psychodiagnostics answered 12/7, 2011 at 15:45 Comment(5)
interesting: even the thread count must be a multiple of the warp size. It makes sense of course, but is not straight from the beginningCarillo
@fabrizioM: The thread count they are talking about is only an optional argument to the bar instruction, and it only exists in PTX 2.0 and newer. I don't believe that the compiler currently generates code which specifies the thread count, and I am not even sure that the assembler will honor the argument and do anything with it anyway.Psychodiagnostics
Thank you for the answer. It's a bit annoying because it will make me have multiple nested-ifs. But I guess that's better than a kernel hang. (Funnily enough, it has never hung so far with return).Adhibit
@Cicada: to avoid nested-ifs, you could repeat the first test: if (threadIdx.x < input.size && threadIdx.x + 1 != input.size)Zoophyte
This answer is wrong, at least as of Volta, please see my answer.Dialytic
C
23

Compute Capability 7.x (Volta) update:

With the introduction of Independent Thread Scheduling among threads in a warp, CUDA is finally more strict in practice, now matching documented behavior. From the Programming Guide:

Although __syncthreads() has been consistently documented as synchronizing all threads in the thread block, Pascal and prior architectures could only enforce synchronization at the warp level. In certain cases, this allowed a barrier to succeed without being executed by every thread as long as at least some thread in every warp reached the barrier. Starting with Volta, the CUDA built-in __syncthreads() and PTX instruction bar.sync (and their derivatives) are enforced per thread and thus will not succeed until reached by all non-exited threads in the block. Code exploiting the previous behavior will likely deadlock and must be modified to ensure that all non-exited threads reach the barrier.

Below is the previous answer, which rambled about pre-Volta behavior.


Update: This answer may not add anything on top of talonmies' (depending on your understanding of the subject, I suppose), but at the risk of being too verbose I'm presenting the information that helped me understand this better. Also, if you are not interested in how things might work "under the hood" or what might be possible beyond the official documentation, there's nothing to see here. That all said, I still don't recommend making assumptions beyond what is officially documented, especially in an environment that hopes to support multiple or future architectures. I primarily wanted to point out that while this is explicitly called out as bad practice by the CUDA Programming Guide, the actual behavior of __syncthreads() may be somewhat different from how it is described and to me that is interesting. The last thing I want is to spread misinformation, so I'm open to discussion and revising my answer!


A few important facts

There is no TL;DR for this answer as there is too much potential for misinterpretation, but here are some relevant facts to start:

  • __syncthreads() behaves like a barrier for warps in a block rather than all of the threads in a block, although when used as advised it amounts to the same thing.
  • If any thread in a warp executes a PTX bar instruction (e.g. from _syncthreads), it is as if all the threads in the warp have.
  • When a bar.sync is called (as generated by the instrinsic __syncthreads()), the arrival count for that block and barrier are incremented by the warp size. This is how the previous points are achieved.
  • Thread divergence (multiple paths) is handled by serializing the execution of the branches. The order of serialization is a factor that can cause trouble.
  • The threads within a warp are not synchronized by __syncthreads(). The instruction will not cause the warp to stall and wait for the threads on divergent paths. Branch execution is serialized, so only when the branches rejoin or the code terminates do the threads in the warp then resynchronize. Until that, the branches run in sequence and independently. Again, only one thread in each warp of the block needs to hit __syncthreads() for execution to continue.

These statements are supported by official documentation and other sources.

Interpretation and documentation

Since __syncthreads() acts as a barrier for warps in a block rather than all of the threads in a block, as it is described in the Programming Guide, it seems that a simple early exit would be fine if at least one thread in each warp hits the barrier. (But that is not to say you can't cause deadlocks with the intrinsic!) This also supposes that __syncthreads() will always generate a simple bar.sync a; PTX instruction and that the semantics of that will not change either, so don't do this in production.

One interesting study that I came across actually investigates what happens when you go against the recommendations of the CUDA Programming Guide, and they found that while it is indeed possible to cause a deadlock by abusing __syncthreads() in conditional blocks, not all use of the intrinsic in conditional code will do so. From Section D.1 in the paper:

The Programming Guide recommends that syncthreads() be used in conditional code only if the condition evaluates identically across the entire thread block. The rest of this section investigates the behavior of syncthreads() when this recommendation is violated. We demonstrate that syncthreads() operates as a barrier for warps, not threads. We show that when threads of a warp are serialized due to branch divergence, any syncthreads() on one path does not wait for threads from the other path, but only waits for other warps running within the same thread block.

This statement is concordant with the bit of the PTX documentation quoted by talonmies. Specifically:

Barriers are executed on a per-warp basis as if all the threads in a warp are active. Thus, if any thread in a warp executes a bar instruction, it is as if all the threads in the warp have executed the bar instruction. All threads in the warp are stalled until the barrier completes, and the arrival count for the barrier is incremented by the warp size (not the number of active threads in the warp).

It is clear from this why the optional thread count b in the bar.sync a{, b}; instruction must be a multiple of warp size -- whenever a single thread in a warp executes a bar instruction the arrival count is incremented by the warp size, not the number of threads in the warp that actually hit the barrier. Threads that terminate early (followed a different path) were effectively counted as arrived anyway. Now, the next sentence in the quoted passage does then say not to use __syncthreads() in conditional code unless "it is known that all threads evaluate the condition identically (the warp does not diverge)." This seems to be an overly strict recommendation (for current architecture), meant to ensure that the arrival count actually reflects the real number of threads that hit the barrier. If at least one thread hitting the barrier increments the arrival count for the entire warp, you might really have a little more flexibility.

There is no ambiguity in the PTX documentation that the bar.sync a; instruction generated by __syncthreads() waits for all threads in the current cooperative thread array (block) to reach barrier a. However, the point is that how "all threads" is presently determined by incrementing the arrival count in multiples of warp size whenever the barrier is hit (by default when b is not specified). This part is not undefined behavior, at least not with Parallel Thread Execution ISA Version 4.2.

Keep in mind that there may be inactive threads in a warp even without a conditional -- "the last threads of a block whose number of threads is not a multiple of the warp size." (SIMT architecture notes). Yet __syncthreads() is not forbidden in such blocks.

Examples

Early exit version 1:

__global__ void kernel(...)

    if (tidx >= N)
        return;      // OK for <32 threads to hit this, but if ALL
                     // threads in a warp hit this, THEN you are deadlocked
                     // (assuming there are other warps that sync)

    __syncthreads(); // If at least one thread on this path reaches this, the 
                     // arrival count for this barrier is incremented by 
                     // the number of threads in a warp, NOT the number of 
                     // threads that reach this in the current warp.
}

This will not deadlock if at least one thread per warp hits the sync, but a possible issue is order of serialization of the execution of divergent code paths. You can change around the above kernel to effectively swap the branches.

Early exit version 2:

__global__ void kernel(...)

    if (tidx < N) {
        // do stuff

        __syncthreads();
    }
    // else return;
}

Still no deadlock if you have at least one thread in the warp hit the barrier, but is the order of branch execution important in this case? I don't think so, but it's probably a bad idea to require a particular execution order.

The paper demonstrates this in a more involved example compared to a trivial early exit that also reminds us to be cautious around warp divergence. Here the first half of the warp (thread id tid on [0,15]) writes to some shared memory and executes __syncthreads(), while the other half (thread id tid on [16,31]) also executes __syncthreads() but now reads from the shared memory locations written by the first half of the warp. Ignoring the shared memory test at first, you might expect a deadlock at either barrier.

// incorrect code to demonstrate behavior of __syncthreads
if (tid < 16 ) {
  shared_array[tid] = tid;
  __syncthreads();
}
else {
  __syncthreads();
  output[tid] =
    shared_array[tid%16];
}

There is no deadlock, indicating that __syncthreads() does not synchronize diverged threads within a warp. Divergent code paths are serialized in a warp and it only takes one thread in a code path to make the call to __syncthreads() work at the per-warp level.

However, the shared memory bit shows where some unpredictable behavior can enter into this. The second half of the warp does not get the updated values from the first half because branch divergence serialized execution of the warp and the else block was executed first. So the function doesn't work right, but it also show that __syncthreads() does not synchronize divergent threads in a warp.

Summary

__syncthreads() does not wait for all threads in a warp, and the arrival of a single thread in a warp effectively counts the entire warp as having reached the barrier. (Present architecture).

It can be dangerous to use __syncthreads() in conditional code because of how divergent thread execution is serialized.

Use the intrinsic in conditional code only if you understand how it works and how branch divergence (which occurs within a warp) is handled.

Note that I didn't say to go ahead and use __syncthreads() in a way inconsistent with how it is documented.

Calcar answered 21/5, 2015 at 19:14 Comment(7)
I'm not sure this answer adds anything more than a lot of words to @Psychodiagnostics answer. The fact is that the supported semantics of __syncthreads() are those documented in the CUDA programming guide. Assuming other semantics should be done at your own risk, as future hardware may change the underlying behavior while still satisfying the documented __syncthreads() semantics.Salesman
@Salesman I'm sorry to hear that, especially from you since you answer was one of the the only sources I could find that suggested it might be safe to do a simple early exit. I only sought to dig up some sources that might explain why and how. To some I suppose quoting the one passage from the PTX documentation and referring to the CUDA programming guide's official recommendation is enough... OK. Anway, you say aspects of hardware might change, do you mean the PTX specification? Because I'm actually saying not to plan for any certain method for serialized execution of divergent code.Calcar
@Salesman BTW, I removed the mention of your name from the post in case it implicitly put your support behind my statements. I also noted that expecting a __syncthreads() to always generate a certain bar.sync instruction is not advisable. Many other edits. If you really think this contributes nothing to the community, I'll delete it. Perhaps it only helped me. Honest thanks for your feedback.Calcar
As a representative of the NVIDIA parallel computing platform, I have a responsibility to provide safe guidance to users. That's all I'm saying. I don't want to belittle all the work you put into this.Salesman
@Salesman I appreciate your position. In fact, it is because of your statements here that I bothered to dig beyond the Programming Guide. Regarding my work on this post, I hadn't even thought if it that way. After posting over 700 answers, and mostly in tags where I actually have some expertise, I know better than to expect a unanimous positive response. In this case, the idea was just to share what I had learned about a subject I found interesting in the hope that someone else might too.Calcar
I think it's a valuable answer. However, taking it into account, it seems that the reduction operation example given here sharcnet.ca/help/index.php/CUDA_tips_and_tricks is wrong, because it uses __syncthreads() inside a branch (the while loop body is kind of a branch). How to do reduction operation then, if we can't use __syncthreads() inside branches?Omniscience
@SergeRogatch Do reduction as explained here: developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/…Simp
D
0

In short: yes it's safe.

The accepted answer may well have been correct when written, but at least since Volta, it is wrong. CUDA docs make clear that the __syncthreads call must be reached by all non-exited threads, which means one can exit early and not cause deadlock.

Dialytic answered 7/6, 2022 at 20:27 Comment(2)
Why is this being downvoted? Given the wording in the CUDA docs ("[__syncthreads] will not succeed until reached by all non-exited threads in the block.") this appears to be the correct answer.Factitious
Thanks. I had just assumed people can’t read 🤷Dialytic

© 2022 - 2024 — McMap. All rights reserved.