CUDA determining threads per block, blocks per grid

Asked 8/12, 2010 at 18:58 Answered 20/12, 2021 at 19:2

Solved cuda dimensions nvidia matrix-multiplication

I'm new to the CUDA paradigm. My question is in determining the number of threads per block, and blocks per grid. Does a bit of art and trial play into this? What I've found is that many examples have seemingly arbitrary number chosen for these things.

I'm considering a problem where I would be able to pass matrices - of any size - to a method for multiplication. So that, each element of C (as in C = A * B) would be calculated by a single thread. How would you determine the threads/block, blocks/grid in this case?

Maren answered 8/12, 2010 at 18:58 Comment(0)

111

In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. The major factors influencing occupancy are shared memory usage, register usage, and thread block size.

A CUDA enabled GPU has its processing capability split up into SMs (streaming multiprocessors), and the number of SMs depends on the actual card, but here we'll focus on a single SM for simplicity (they all behave the same). Each SM has a finite number of 32 bit registers, shared memory, a maximum number of active blocks, AND a maximum number of active threads. These numbers depend on the CC (compute capability) of your GPU and can be found in the middle of the Wikipedia article http://en.wikipedia.org/wiki/CUDA.

First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them.

Second, before worrying about shared memory and registers, try to size your blocks based on the maximum numbers of threads and blocks that correspond to the compute capability of your card. Sometimes there are multiple ways to do this... for example, a CC 3.0 card each SM can have 16 active blocks and 2048 active threads. This means if you have 128 threads per block, you could fit 16 blocks in your SM before hitting the 2048 thread limit. If you use 256 threads, you can only fit 8, but you're still using all of the available threads and will still have full occupancy. However using 64 threads per block will only use 1024 threads when the 16 block limit is hit, so only 50% occupancy. If shared memory and register usage is not a bottleneck, this should be your main concern (other than your data dimensions).

On the topic of your grid... the blocks in your grid are spread out over the SMs to start, and then the remaining blocks are placed into a pipeline. Blocks are moved into the SMs for processing as soon as there are enough resources in that SM to take the block. In other words, as blocks complete in an SM, new ones are moved in. You could make the argument that having smaller blocks (128 instead of 256 in the previous example) may complete faster since a particularly slow block will hog fewer resources, but this is very much dependent on the code.

Regarding registers and shared memory, look at that next, as it may be limiting your occupancy. Shared memory is finite for a whole SM, so try to use it in an amount that allows as many blocks as possible to still fit on an SM. The same goes for register use. Again, these numbers depend on compute capability and can be found tabulated on the wikipedia page.

Haggard answered 16/10, 2012 at 19:11 Comment(5)

Why we can't calculate blocks/threads using device capability? Supose I have the device with 2.1 capability (GT 520), so it has 48 SM's, 8 blocks each and 1024 threads per block. I can't get it, sorry if my question is silly. – Janetjaneta 18/7, 2013 at 12:30

Is there a way to determine the maximum number of threads per block programmatically using some CUDA API? – Burford 28/5, 2016 at 8:12

cudaDeviceProp prop; cudaGetDeviceProperties( &prop, 0); docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html – Recursion 15/5, 2020 at 11:57

@Haggard You say "the blocks in your grid are spread out over the SMs to start". Is this something the programmer has to do or is the GPU responsible to spread the blocks over the SMs? In case of the latter: if I run a kernel on 32 blocks while using a hardware having 32 SMs, is there a warranty that each SM receives one block? – Nightingale 18/7, 2021 at 22:13

@Nightingale you have no warranty and the GPU does it by itself. Several kernels can run at the same time (with different streams), so not all SMs may be available at that time. Notebook GPUs could deactivate some SMs for power management. With small inline PTX assembly you can read the SM id a kernel is running on (%smid) and also find out the warp id (%warpid) within a SM. SMs are internally partitioned into 4 SMSP units and each warp is running on a different one. – Frisian 21/12, 2021 at 8:32

https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html

The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA program threads. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail...

Madeline answered 9/12, 2010 at 4:54 Comment(1)

404 - Not Found. developer.download.nvidia.com/compute/cuda/3_2_prod/sdk/docs/… seems to work. Also online version xmartlabs.github.io/cuda-calculator – Lotuseater 15/8, 2019 at 6:9

With rare exceptions, you should use a constant number of threads per block. The number of blocks per grid is then determined by the problem size, such as the matrix dimensions in the case of matrix multiplication.

Choosing the number of threads per block is very complicated. Most CUDA algorithms admit a large range of possibilities, and the choice is based on what makes the kernel run most efficiently. It is almost always a multiple of 32, and at least 64, because of how the thread scheduling hardware works. A good choice for a first attempt is 128 or 256.

Gang answered 8/12, 2010 at 19:20 Comment(1)

Could you give a better explanation of your reasoning for constant threads per block? (Or even linking a relevant article). Much thanks, – Vaseline 9/1, 2018 at 17:37

You also need to consider shared memory because threads in the same block can access the same shared memory. If you're designing something that requires a lot of shared memory, then more threads-per-block might be advantageous.

For example, in terms of context switching, any multiple of 32 works just the same. So for the 1D case, launching 1 block with 64 threads or 2 blocks with 32 threads each makes no difference for global memory accesses. However, if the problem at hand naturally decomposes into 1 length-64 vector, then the first option will be better (less memory overhead, every thread can access the same shared memory) than the second.

Clava answered 8/11, 2011 at 20:3 Comment(0)

There is no silver bullet. The best number of threads per block depends a lot on the characteristics of the specific application being parallelized. CUDA's design guide recommends using a small amount of threads per block when a function offloaded to the GPU has several barriers, however, there are experiments showing that for some applications a small number of threads per block increases the overhead of synchronizations, imposing a larger overhead. In contrast, a larger number of threads per block may decrease the amount of synchronizations and improve the overall performance.

For an in-depth discussion (too lengthy for StackOverflow) about the impact of the number of threads per block on CUDA kernels, check this journal article, it shows tests of different configurations of the number of threads per block in the NPB (NAS Parallel Benchmarks) suite, a set of CFD (Computational Fluid Dynamics) applications.

Cleveland answered 20/12, 2021 at 19:2 Comment(0)

Recommended topics

Hot tags