GPU: Quadro RTX 4000
CUDA: 11.7
I have implemented a global lock by atomic operations like:
__global__ void floatAddLockExch(float* addr, volatile int* lock) {
bool goon = true;
while (goon) {
if (atomicCAS((int*)lock, 0, 1) == 0) {
*addr += 1; //critical task
int lockValue = atomicExch((int*)lock, 0);
if (lockValue != 1) {
printf("Error in <%d, %d> \n", blockIdx.x, threadIdx.x);
}
//
// *lock = 0;
// __threadfence();
goon = false;
}
}
}
Tested this kernel by <<<1
, 1024>>>, the output(value of *addr) was 1024
; tested by <<<2
, 1024>>>, the output was 1025
. There was No
"Error..." output in both cases.
Helped by Cuda atomics change flag, I've implemented the kernel as:
__global__ void floatAddLockFence(float* addr, volatile int* lock) {
bool goon = true;
while (goon) {
if (atomicCAS((int*)lock, 0, 1) == 0) {
*addr += 1; //critical task
// int lockValue = atomicExch((int*)lock, 0);
// if (lockValue != 1) {
// printf("Error in <%d, %d> \n", blockIdx.x, threadIdx.x);
// }
*lock = 0;
__threadfence();
goon = false;
}
}
}
The output was 1024
in <<<1
, 1024>>> case and 2048
in <<<2
, 1024>>> case.
It is supposed that atomic operation on global variable lock
is atomic across all blocks, why floatAddLockExch
failed in multi-block case? How __threadfence()
solved the problem?
volatile float* addr
worked forfloatAddLockExch
. Thank you. – Bog