I've always worked with linear shared memory (load, store, access neighbors) but I've made a simple test in 2D to study bank conflicts which results have confused me.
The next code read data from one dimensional global memory array to shared memory and copy it back from shared memory to global memory.
__global__ void update(int* gIn, int* gOut, int w) {
// shared memory space
__shared__ int shData[16][16];
// map from threadIdx/BlockIdx to data position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global id into the one dimensional array
int gid = x + y * w;
// load shared memory
shData[threadIdx.x][threadIdx.y] = gIn[gid];
// synchronize threads not really needed but keep it for convenience
__syncthreads();
// write data back to global memory
gOut[gid] = shData[threadIdx.x][threadIdx.y];
}
The visual profiler reported conflicts in shared memory. The next code avoid thouse conflicts (only show the differences)
// load shared memory
shData[threadIdx.y][threadIdx.x] = gIn[gid];
// write data back to global memory
gOut[gid] = shData[threadIdx.y][threadIdx.x];
This behavior has confused me because in Programming Massively Parallel Processors. A Hands-on approach we can read:
matrix elements in C and CUDA are placed into the linearly addressed locations according to the row major convention. That is, the elements of row 0 of a matrix are first placed in order into consecutive locations.
Is this related to shared memory arrangement? or with threads indexes? Maybe am I missing something?
The kernel configuration is as follow:
// kernel configuration
dim3 dimBlock = dim3 ( 16, 16, 1 );
dim3 dimGrid = dim3 ( 64, 64 );
// Launching a grid of 64x64 blocks with 16x16 threads -> 1048576 threads
update<<<dimGrid, dimBlock>>>(d_input, d_output, 1024);
Thanks in advance.