I have a question about CUDA synchronizing. Particularly, I need some clarification about synchronizing in if statements. I mean, if I put a __syncthreads() under the scope of an if statement hit by a fraction of the threads inside the block, what happens? I thought that some threads will remain "forever" waiting for the other threads that won't hit the synchronizing point. So, I wrote and executed some sample code to inspect:
__global__ void kernel(float* vett, int n)
{
int index = blockIdx.x*blockDim.x + threadIdx.x;
int gridSize = blockDim.x*gridDim.x;
while( index < n )
{
vett[index] = 2;
if(threadIdx.x < 10)
{
vett[index] = 100;
__syncthreads();
}
__syncthreads();
index += gridSize;
}
}
Surprisingly enough, I observed that the output was a pretty "normal" (64 elements, blocksize 32):
100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
100 100 100 100 100 100 100 100 100 100 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
So I modified slightly my code in the following way:
__global__ void kernel(float* vett, int n)
{
int index = blockIdx.x*blockDim.x + threadIdx.x;
int gridSize = blockDim.x*gridDim.x;
while( index < n )
{
vett[index] = 2;
if(threadIdx.x < 10)
{
vett[index] = 100;
__syncthreads();
}
__syncthreads();
vett[index] = 3;
__syncthreads();
index += gridSize;
}
}
And the output was:
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Again, I was wrong: I thought that the threads inside the if statement, after modifying the element of the vector, would remain in a wait state and never get out of the if scope. So... could you please clarify what happened? Does a thread that gets after a synchronizing point unblock the threads waiting at the barrier? If you need to reproduce my situation, I used CUDA Toolkit 5.0 RC with SDK 4.2. Thanks a lot in advance.