Dynamic parallelism - launching many small kernels is very slow
Asked Answered
N

1

2

I am trying to use dynamic parallelism to improve an algorithm I have in CUDA. In my original CUDA solution, every thread computes a number that is common for each block. What I want to do is to first launch a coarse (or low resolution) kernel, where threads compute the common value just once (like if every thread represents one block). Then each thread creates a small grid of 1 block (16x16 threads), and launches a child kernel for it passing the common value. In theory it should be faster because one is saving many redundant operations. But in practice, the solution works very slow, I don't know why.

This is the code, very simplified, just the idea.

__global__ coarse_kernel( parameters ){
    int common_val = compute_common_val();
    dim3 dimblock(16, 16, 1);
    dim3 dimgrid(1, 1, 1);
    child_kernel <<< dimgrid, dimblock >>> (common_val, parameters);

}

__global__ child_kernel( int common_val, parameters ){
    // use common value
    do_computations(common_val, parameters);
}

The amount of child_kernels is a lot, one per thread and there must be around 400x400 threads. From what I understand, the GPU should process all these kernels in parallel, right?

Or child kernels are processed somehow sequentially?

My results show that performance is more than 10 times slower than in the original solution I had.

Novanovaculite answered 7/1, 2014 at 11:7 Comment(4)
There is a cost in launching kernels, either parent or child. If your child kernels do not extract much parallelism and there is not much benefit against their non-parallel counterparts, then your faint benefit may be cancelled out by the child kernel launch overheads.Wort
@JackOLantern But shouldnt these child kernels all execute in parallel to fill the GPU ?? or they execute one by one?Novanovaculite
I try to explain in formulas what I meant above. Let to the overhead to execute a child kernel, te its execution time and ts the time to execute the same code without the help of dynamic parallelism. The speedup arising from the use of dynamic parallelism is ts/(to+te). Perhaps (but this cannot be envinced from your code) te<ts but te,ts<<to, so that ts/(to+te) is about (ts/to)<1 and you observe a slowdown instead of a speedup.Wort
I understand thanks. I will have to check the documentation about the overhead and also see the limits on the amount of child kernels executed in parallel. Most of internet samples of dynamic parallelism usually show a linear recursion (just one child kernel launched from thread 0) or at most a binary ramification. I will check if i can reduce the number of child kernels and make the grids larger. The maximum number of simultaneous streams in kepler GK110 is 32 i think. I guess i am exceeding that value by a lot.Novanovaculite
W
3

There is a cost in launching kernels, either parent or child. If your child kernels do not extract much parallelism and there is not much benefit against their non-parallel counterparts, then your faint benefit may be cancelled out by the child kernel launch overheads.

In formulas, let to be the overhead to execute a child kernel, te its execution time and ts the time to execute the same code without the help of dynamic parallelism. The speedup arising from the use of dynamic parallelism is ts/(to+te). Perhaps (but this cannot be envinced from your code) te<ts but te,ts<<to, so that ts/(to+te) is about (ts/to)<1 and you observe a slowdown instead of a speedup.

Wort answered 8/1, 2014 at 22:10 Comment(3)
Jack do you think that $to$ actually blows up when more than 32 kernels are launched?.Novanovaculite
I don't mean that to actually blows up. I'm just giving an interpretation of what probably happens to your code, but there are too few details in your snippet above to provide a definite answer. If your child kernels have not much to do, then the child kernel launch overhead may prevent you to observe any improvement from the use of dynamic parallelism.Wort
Its allright. I will keep researching about this but i am starting to see the requisites for DP to work properly. Thanks for your help.Novanovaculite

© 2022 - 2024 — McMap. All rights reserved.