How lightweight are operations for creating and destroying CUDA streams? E.g. for CPU threads these operations are heavy, therefore they usually pool CPU threads. Shall I pool CUDA streams too? Or is it fast to create a stream every time I need it and then destroy it?
Guidance from NVIDIA is that you should pool CUDA streams. Here is a comment from the horse's mouth, https://github.com/pytorch/pytorch/issues/9646:
There is a cost to creating, retaining, and destroying CUDA streams in PyTorch master. In particular:
- Tracking CUDA streams requires atomic refcounting
- Destroying a CUDA stream can (rarely) cause implicit device synchronization
- The refcounting issue has been raised as a concern for expanding stream tracing to allow streaming backwards, for example, and it's clearly best to avoid implicit device synchronization as it causes an often unexpected performance degradation.
For static frameworks the recommended best practice is to create all the needed streams upfront and destroy them after the work is done. This pattern is not immediately applicable to PyTorch, but a per device stream pool would achieve a similar effect.
It probably doesn't matter whether creating streams is fast or not. Creating them once and reusing then will always be faster than continually creating and destroying them.
Whether amortizing that latency is actually important depends on your application much more than anything else.
© 2022 - 2024 — McMap. All rights reserved.