Upload data in shared memory for convolution kernel

#define Mask_width 5 #define Mask_radius Mask_width/2 #define TILE_WIDTH 16 #define w (TILE_WIDTH + Mask_width - 1) #define clamp(x) (min(max((x), 0.0), 1.0)) __global__ void convolution(float *I, const float* __restrict__ M, float *P, int channels, int width, int height) { __shared__ float N_ds[w][w]; int k; for (k = 0; k < channels; k++) { // First batch loading int dest = threadIdx.y * TILE_WIDTH + threadIdx.x, destY = dest / w, destX = dest % w, srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius, srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius, src = (srcY * width + srcX) * channels + k; if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) N_ds[destY][destX] = I[src]; else N_ds[destY][destX] = 0; // Second batch loading dest = threadIdx.y * TILE_WIDTH + threadIdx.x + TILE_WIDTH * TILE_WIDTH; destY = dest / w, destX = dest % w; srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius; srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius; src = (srcY * width + srcX) * channels + k; if (destY < w) { if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width) N_ds[destY][destX] = I[src]; else N_ds[destY][destX] = 0; } __syncthreads(); float accum = 0; int y, x; for (y = 0; y < Mask_width; y++) for (x = 0; x < Mask_width; x++) accum += N_ds[threadIdx.y + y][threadIdx.x + x] * M[y * Mask_width + x]; y = blockIdx.y * TILE_WIDTH + threadIdx.y; x = blockIdx.x * TILE_WIDTH + threadIdx.x; if (y < height && x < width) P[(y * width + x) * channels + k] = clamp(accum); __syncthreads(); } }

Your question is similar in concept to my first question on StackOverflow: Moving a (BS_X+1)(BS_Y+1) global memory matrix by BS_XBS_Y threads.

You are facing the following problem: each thread block of size TILE_WIDTHxTILE_WIDTH should fill a shared memory area of size (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1).

4) Generally, what is the intuitive explanation of having two loadings?

Since the shared memory area (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1) is larger than the block size TILE_WIDTHxTILE_WIDTH and assuming it is smaller than 2xTILE_WIDTHxTILE_WIDTH, then each thread should move at most two elements from global memory to shared memory. This is the reason why you have a two-stages loading.

1) The destY and destX index. Dividing the output index by the input tile width what does it means?

This concerns the first load stage which is appointed to load TILE_WIDTHxTILE_WIDTH elements from global memory and fills the uppermost part of the shared memory area.

So, the operation

dest = threadIdx.y * TILE_WIDTH + threadIdx.x;

flattens the 2D coordinates of the generic thread while

destX = dest % w;
destY = dest / w;

makes the inverse operation, in that it calculates the 2D coordinates of the generic thread with respect to the shared memory area.

2) The srcY add srcX index. Why destY and destX index take part in srcY add srcX index?

srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;

srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;

(blockIdx.x * TILE_WIDTH, blockIdx.y * TILE_WIDTH) would be the coordinates of the global memory location if the block size and the shared memory size were the same. Since you are "borrowing" memory values also from neighboor tiles, then you have to shift the above coordinates by (destX - Mask_radius, destY - Mask_radius).

3) Why in the second loading we use the offset TILE_WIDTH * TILE_WIDTH?

You have this offset because in the first memory stage you have already filled the "first" TILE_WIDTHxTILE_WIDTH locations of the shared memory.

EDIT

The picture below illustrates the correspondence between the flattened thread index dest and the shared memory locations. In the picture, the blue boxes represent the elements of the generic tile while the red boxes the elements of the neighboor tiles. The union of the blue and red boxes correspond to the overall shared memory locations. As you can see, all the 256 threads of a thread block are involved in filling the upper part of the shared memory above the green line, while only 145 are involved in filling the lower part of the shared memory below the green line. Now you should also understand the TILE_WIDTH x TILE_WIDTH offset.

Please, notice that you have at most 2 memory loads per thread due to the particular choice of your parameters. For example, if you have TILE_WIDTH = 8, then the number of threads in a thread block is 64, while the shared memory size is 12x12=144, which means that each thread is in charge to perform at least 2 shared memory writes since 144/64=2.25.

enter image description here

Recommended topics

Hot tags