PREAMBLE: Assume I use an NVIDIA GTX480 card in CUDA. The theoretical peak global memory bandwidth for this card is 177.4 GB/s: 384*2*1848/8 *1E9 = 177.4 GB/s
The 384 comes from the memory interface width, 2 form the DDR nature of the memory, 1848 is the memory clock frequency (in MHz), the 8 comes from the fact that i want to get my answer in Bytes.
Something similar can be computed for the shared memory: 4 bytes per bank * 32 banks * 0.5 banks per cycle * 1400MHz * 15 SMs = 1,344 GB/s
The number above factors in the number of SMs, that is, 15. Thus, to reach this max shared memory bandwidth I need to have all 15 SMs reading shared memory.
MY QUESTION: In order to reach the max global memory bandwidth, does it suffice to have only one SM read from global memory, or should all SMs attempt to read from global memory at the same time? More specifically, imagine I launch a kernel with one block with 32 threads. Then, if I have the one and only warp on SM-0, and all that I do in the kernel is read nonstop from global memory in a coalesced fashion, will I reach the 177.4 GB/s? Or should I launch at least 15 blocks, each with 32 threads, so that the 15 warps on SM-0 through SM-14 attempt to read at the same time?
The immediate thing to do would probably be to run a benchmark test to figure this out. I would like though to understand why what happens, happens.