Shall I pool CUDA streams?

Asked 17/6, 2018 at 10:25 Answered 22/10, 2018 at 16:58

Solved c++parallel-processing cuda stream pool

How lightweight are operations for creating and destroying CUDA streams? E.g. for CPU threads these operations are heavy, therefore they usually pool CPU threads. Shall I pool CUDA streams too? Or is it fast to create a stream every time I need it and then destroy it?

Jacky answered 17/6, 2018 at 10:25 Comment(0)

Guidance from NVIDIA is that you should pool CUDA streams. Here is a comment from the horse's mouth, https://github.com/pytorch/pytorch/issues/9646:

There is a cost to creating, retaining, and destroying CUDA streams in PyTorch master. In particular:

Tracking CUDA streams requires atomic refcounting

Destroying a CUDA stream can (rarely) cause implicit device synchronization

The refcounting issue has been raised as a concern for expanding stream tracing to allow streaming backwards, for example, and it's clearly best to avoid implicit device synchronization as it causes an often unexpected performance degradation.

For static frameworks the recommended best practice is to create all the needed streams upfront and destroy them after the work is done. This pattern is not immediately applicable to PyTorch, but a per device stream pool would achieve a similar effect.

Whitefish answered 22/10, 2018 at 16:58 Comment(1)

This looks to me like a wrong answer, see: forums.developer.nvidia.com/t/… tl;dr: creating streams is lightweight, it is OK to create a few streams a couple times per second. – Daydream 17/10 at 9:20

It probably doesn't matter whether creating streams is fast or not. Creating them once and reusing then will always be faster than continually creating and destroying them.

Whether amortizing that latency is actually important depends on your application much more than anything else.

Saltern answered 17/6, 2018 at 10:25 Comment(1)

It does matter because code complexity and time spent on isolating the specific performance effect both matter, and both can be avoided if creating streams has negligible overhead. – Daydream 17/10 at 9:24

Recommended topics

Hot tags