What you want ultimately depends on whether your input data is a 1D or 2D array, and whether your grid and blocks are 1D or 2D. The simplest case is both 1D:
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + threadIdx.x];
This is coalesced. The rule of thumb I use is that the most rapidly varying coordinate (the threadIdx) is added on as offset to the block offset (blockDim * blockIdx). The end result is that the indexing stride between threads in the block is 1. If the stride gets larger, then you lose coalescing.
The simple rule (on Fermi and later GPUs) is that if the addresses for all threads in a warp fall into the same aligned 128-byte range, then a single memory transaction will result (assuming caching is enabled for the load, which is the default). If they fall into two aligned 128-byte ranges, then two memory transactions result, etc.
On GT2xx and earlier GPUs, it gets more complicated. But you can find the details of that in the programming guide.
Additional examples:
Not coalesced:
shmem[threadIdx.x] = gmem[blockDim.x + blockIdx.x * threadIdx.x];
Not coalesced, but not too bad on GT200 and later:
stride = 2;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];
Not coalesced at all:
stride = 32;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];
Coalesced, 2D grid, 1D block:
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.x] = gmem[blockIdx.y * elementPitch +
blockIdx.x * blockDim.x + threadIdx.x];
Coalesced, 2D grid and block:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.y * blockDim.x + threadIdx.x] = gmem[y * elementPitch + x];