CUDA - how much slower is transferring over PCI-E?

Asked 18/7, 2013 at 16:41 Answered 22/7, 2020 at 6:13

If I transfer a single byte from a CUDA kernel to PCI-E to the host (zero-copy memory), how much is it slow compared to transferring something like 200 Megabytes?

What I would like to know, since I know that transferring over PCI-E is slow for a CUDA kernel, is: does it change anything if I transfer just a single byte or a huge amount of data? Or perhaps since memory transfers are performed in "bulks", transferring a single byte is extremely expensive and useless with respect to transferring 200 MBs?

Northeastward answered 18/7, 2013 at 16:41 Comment(4)

The bandwidth test example which has shipped with CUDA forever is specifically designed to answer this question. – Cutoff 18/7, 2013 at 17:7

I currently don't have a CUDA gpu right now, can you give me a hint on the results? – Northeastward 18/7, 2013 at 17:31

This has to do with the overhead of launching a transfer request. For example 200 1MB requests will be slower than a single 200MB transfer. – Noncombatant 18/7, 2013 at 18:3

If u have large data to be transferred to the GPU for processing.. then its best to look into following concepts 1) streams and 2) async copy.. here is code for checking the bandwidth u might want to look into it.. – Bousquet 19/7, 2013 at 2:28

Hope this pic explain everything. The data is generated by bandwidthTest in CUDA samples. The hardware environment is PCI-E v2.0, Tesla M2090 and 2x Xeon E5-2609. Please note both axises are in log scale.

Given this figure, we can see that the overhead of launching a transfer request takes a constant time. Regression analysis on the data gives an estimated overhead time of 4.9us for H2D, 3.3us for D2H and 3.0us for D2D.

enter image description here

Bogosian answered 19/7, 2013 at 12:20 Comment(2)

I don't understand this chart very well. For example, which one takes more time (in total time, not in speed): a transfer of 1 byte or a transfer of 100 bytes? – Selfpropelled 17/8, 2017 at 13:7

@étale-cohomology for 1-byte and 100-byte, they are almost the same. It is because the constant overhead takes the majority part of the total time. – Bogosian 17/10, 2017 at 5:36

-1

The latency plot would be more clear in this case. Small transactions aren't more expensive than big ones. The only problem with them is that they can't saturate the bus. Therefore it's possible to transfer bigger messages at almost the same time. That is why transferring one 512 KB is 120 times faster than transferring 512 1 KB transactions. The saturation point of PCIe depends on lanes count. You could find more details about PCIe features from CUDA point of view here.

Benz answered 22/7, 2020 at 6:13 Comment(0)

Recommended topics

Hot tags