CUDA: __syncthreads() inside if statements
Asked Answered
S

5

20

I have a question about CUDA synchronizing. Particularly, I need some clarification about synchronizing in if statements. I mean, if I put a __syncthreads() under the scope of an if statement hit by a fraction of the threads inside the block, what happens? I thought that some threads will remain "forever" waiting for the other threads that won't hit the synchronizing point. So, I wrote and executed some sample code to inspect:

__global__ void kernel(float* vett, int n)
{
    int index = blockIdx.x*blockDim.x + threadIdx.x;
    int gridSize = blockDim.x*gridDim.x;

    while( index < n )
    {   
        vett[index] = 2;
        if(threadIdx.x < 10)
        {
            vett[index] = 100;
            __syncthreads();
        }
        __syncthreads();

        index += gridSize;
    }
}

Surprisingly enough, I observed that the output was a pretty "normal" (64 elements, blocksize 32):

100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

So I modified slightly my code in the following way:

__global__ void kernel(float* vett, int n)
{
    int index = blockIdx.x*blockDim.x + threadIdx.x;
    int gridSize = blockDim.x*gridDim.x;

    while( index < n )
    {   
        vett[index] = 2;
        if(threadIdx.x < 10)
        {
            vett[index] = 100;
            __syncthreads();
        }
        __syncthreads();
            vett[index] = 3;
        __syncthreads();

        index += gridSize;
    }
}

And the output was:

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 

Again, I was wrong: I thought that the threads inside the if statement, after modifying the element of the vector, would remain in a wait state and never get out of the if scope. So... could you please clarify what happened? Does a thread that gets after a synchronizing point unblock the threads waiting at the barrier? If you need to reproduce my situation, I used CUDA Toolkit 5.0 RC with SDK 4.2. Thanks a lot in advance.

Subjoin answered 20/9, 2012 at 19:27 Comment(1)
Give checkmarks (accepted answers) to people who answer your question.Swirl
G
19

In short, the behavior is undefined. So it may sometimes do what you want, or it may not, or (quite likely) will just hang or crash your kernel.

If you are really curious how things work internally, you need to remember that threads do not execute independently, but a warp (group of 32 threads) at a time.

This of course creates a problem with conditional branches where the conditional does not evaluate uniformly throughout the warp. The problem is solved by execution both paths, one after the other, each with those threads disabled that are not supposed to execute that path. IIRC on existing hardware the branch is taken first, then the path is executed where the branch is not taken, but this behavior is undefined and thus not guaranteed.

This separate execution of paths continues up to some point for which the compiler can determine that it is guaranteed to be reached by all threads of the two separate execution paths (the "reconvergence point" or "synchronization point"). When execution of the first code path reaches this point, it is stopped and the second code path is executed instead. When the second path reaches the synchronization point, all threads are enabled again and execution continues uniformly from there.

The situation gets more complicated if another conditional branch is encountered before the synchronization. This problem is solved with a stack of paths that still need to be executed (luckily the growth of the stack is limited as we can have at most 32 different code paths for one warp).

Where the synchronization points are inserted is undefined and even varies slightly between architectures, so again there are no guarantees. The only (unofficial) comment you will get from Nvidia is that the compiler is pretty good at finding optimal synchronization points. However there are often subtle issues that may move the optimal point further down than you might expect, particularly if threads exit early.

Now to understand the behavior of the __syncthreads() directive, (which translates into a bar.sync instruction in PTX) it is important to realize that this instruction is not executed per thread, but for the whole warp at once (regardless of whether any threads are disabled or not) because only the warps of a block need to be synchronized. The threads of a warp are already executing in sync, and further synchronization will either have no effect (if all threads are enabled) or lead to a deadlock when trying to sync the threads from different conditional code paths.

You can work your way from this description to how your particular piece of code behaves. But keep in mind that all this is undefined, there are no guarantees, and relying on a specific behavior may break your code at any time.

You may want to look at the PTX manual for some more details, particularly for the bar.sync instruction that __syncthreads() compiles to. Henry Wong's "Demystifying GPU Microarchitecture through Microbenchmarking" paper, referenced below by ahmad, is also well worth reading. Even though for now outdated architecture and CUDA version, the sections about conditional branching and __syncthreads() appear to still be generally valid.

Gurgle answered 20/9, 2012 at 20:28 Comment(0)
K
5

CUDA model is MIMD but current NVIDIA GPUs implement __syncthreads() at warp granularity instead of thread. It means, these are warps inside a thread-block who are synchronized not necessarily threads inside a thread-block. __syncthreds() waits for all 'warps' of thread-block to hit the barrier or exit the program. Refer to Henry Wong's Demistifying paper for further details.

Kinata answered 20/9, 2012 at 20:27 Comment(1)
That paper is indeed a good reference. I had forgotten it also covers conditional branching.Gurgle
T
3

You must not use __syncthreads() unless the statement is reached in all threads within one thread block, always. From the programming guide (B.6):

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

Basically, your code is not a well-formed CUDA program.

Tingaling answered 20/9, 2012 at 20:7 Comment(3)
Of course it isn't! But I wrote it only for the purpose of inspecting its behaviour.Subjoin
@biagiop1986: Well... you have a piece of library code and hardware that comes with a documentation that says, "you must not do X". Now you're asking us, the public, what happens if you do do X - how are we supposed to know? Ask the vendor! Isn't it enough to know that the program will be ill-formed?Tingaling
It depends...it's right to say that I should avoid code like that in my programs 'cause it's ill-formed (and, I swear, I will!), but I was curious about the 'how'. And, moreover, I often found here explications about problems much better than vendor-explication. So, I'll be back here asking you instead of everyone else for every coding problem I'll have in the future. Stackoverflow is the best! Thank you all, by the way.Subjoin
C
1

__syncthreads() is used to synchronize threads within a block. That means all threads in the block will wait for all to complete before proceeding on.

Consider the case where there are some threads in a block, which enters the if-statement and some do not. Those threads waiting, will be blocked; forever waiting.

Generally, it is not a good style to put synchronize in a if-conditional statement. Best to avoid it, and redesign your code if you have it. The purpose of synchronize is to make sure all threads proceed together, why do you filter them out using if-statement in the first place?

To add on, if synchronization is required across the blocks. Relaunch of the kernel is required.

Crumley answered 21/9, 2012 at 7:28 Comment(0)
N
0

Better avoid __syncthreads() in if-conditional. You can rewrite the code with for loop and __syncthreads() after for loop.

Nefertiti answered 16/9, 2020 at 14:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.