Is CPU to GPU data transfer slow in TensorFlow?
Asked Answered
R

1

9

I've tested CPU to GPU data transfer throughput with TensorFlow and it seems to be significantly lower than in PyTorch. For large tensors between 2x and 5x slower. In TF, I reach maximum speed for 25MB tensors (~4 GB/s) and it drops down to 2 GB/s with increasing tensor size. PyTorch data transfer speed grows with tensor size and saturates at 9 GB/s (25MB tensors). The behavior is consistent on RTX 2080ti and GTX 1080ti, and with TF 2.4 and 2.6.

Am I doing something wrong? Is there some way how to match the data throughput of PyTorch? I'm not just looking to hide the latency e.g. using async queues, but I'd like to get the full data bandwidth.

Results on batches of 256x256x3 images in TF (avarageg over 100 transfers):

code: tf.cast(x, dtype=tf.float32)[0, 0]
Batch size 1; Batch time 0.0005; BPS 1851.8; FPS 1851.8; MB/S 364.1
Batch size 2; Batch time 0.0004; BPS 2223.5; FPS 4447.1; MB/S 874.3
Batch size 4; Batch time 0.0006; BPS 1555.2; FPS 6220.6; MB/S 1223.0
Batch size 8; Batch time 0.0006; BPS 1784.8; FPS 14278.7; MB/S 2807.3
Batch size 16; Batch time 0.0013; BPS 755.3; FPS 12084.7; MB/S 2376.0
Batch size 32; Batch time 0.0023; BPS 443.8; FPS 14201.3; MB/S 2792.1
Batch size 64; Batch time 0.0035; BPS 282.5; FPS 18079.5; MB/S 3554.6
Batch size 128; Batch time 0.0061; BPS 163.4; FPS 20916.4; MB/S 4112.3
Batch size 256; Batch time 0.0241; BPS 41.5; FPS 10623.0; MB/S 2088.6
Batch size 512; Batch time 0.0460; BPS 21.7; FPS 11135.8; MB/S 2189.4

Same results with PyTorch:

Code: torch.from_numpy(x).to(self.device).type(torch.float32)[0, 0].cpu()
Batch size 1; Batch time 0.0001; BPS 10756.6; FPS 10756.6; MB/S 2114.8
Batch size 1; Batch time 0.0001; BPS 12914.7; FPS 12914.7; MB/S 2539.1
Batch size 2; Batch time 0.0001; BPS 10204.4; FPS 20408.7; MB/S 4012.5
Batch size 4; Batch time 0.0002; BPS 5841.1; FPS 23364.3; MB/S 4593.6
Batch size 8; Batch time 0.0003; BPS 3994.4; FPS 31955.4; MB/S 6282.7
Batch size 16; Batch time 0.0004; BPS 2713.8; FPS 43421.3; MB/S 8537.0
Batch size 32; Batch time 0.0007; BPS 1486.3; FPS 47562.7; MB/S 9351.2
Batch size 64; Batch time 0.0015; BPS 679.3; FPS 43475.9; MB/S 8547.7
Batch size 128; Batch time 0.0028; BPS 359.5; FPS 46017.7; MB/S 9047.5
Batch size 256; Batch time 0.0054; BPS 185.2; FPS 47404.1; MB/S 9320.0
Batch size 512; Batch time 0.0108; BPS 92.9; FPS 47564.5; MB/S 9351.6

The full code to reproduce the measurements is:

import time
import numpy as np
import tensorflow as tf
import torch
import argparse


def parseargs():
    parser = argparse.ArgumentParser(usage='Test GPU transfer speed in TensorFlow(default) and Pytorch.')
    parser.add_argument('--pytorch', action='store_true', help='Use PyTorch instead of TensorFlow')
    args = parser.parse_args()
    return args


class TimingModelTF(tf.keras.Model):
    def __init__(self, ):
        super(TimingModelTF, self).__init__()

    @tf.function
    def call(self, x):
        return tf.cast(x, dtype=tf.float32)[0, 0]


class TimingModelTorch(torch.nn.Module):
    def __init__(self, ):
        super(TimingModelTorch, self).__init__()
        self.device = torch.device('cuda')

    def forward(self, x):
        with torch.no_grad():
            return torch.from_numpy(x).to(self.device).type(torch.float32)[0, 0].cpu()


if __name__ == '__main__':
    args = parseargs()
    width = 256
    height = 256
    channels = 3
    iterations = 100
    model = TimingModelTorch() if args.pytorch else TimingModelTF()

    for batch_size in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]:
        img = np.random.randint(5, size=(batch_size, height, width, channels), dtype=np.uint8)

        result = model(img)
        result.numpy()

        start = time.time()
        for i in range(iterations):
            result = model(img)
            result.numpy()
        batch_time = (time.time() - start) / iterations
        print(f'Batch size {batch_size}; Batch time {batch_time:.4f}; BPS {1 / batch_time:.1f}; FPS {(1 / batch_time) * batch_size:.1f}; MB/S {(((1 / batch_time) * batch_size) * 256 * 256 * 3) / 1000000:.1f}')

Rowdyish answered 23/11, 2021 at 13:16 Comment(4)
Probably pytorch using pinned buffers and tensorflow can still pipeline multiple operations to get close to pinned buffer performance.Abiogenesis
I'm not sure I understand. The code does not use pinned memory (host) - it is a numpy array which is definitely paged. And how would pipelining improve CPU-GPU throughput? My understanding of pinned memory is from developer.nvidia.com/blog/how-optimize-data-transfers-cuda-ccPacheco
Pinning array to gpu, not cpu, should decrease unnecessary copies in tf. For pytorch, .cpu() returns original object without copy if it is already in cpu.Abiogenesis
OK. Pin to GPU = copy all your data to GPU and keep it there and use only that data. That does not help by itself, data does not fit into GPU memory. The question remains - can I get data to GPU faster than in the posted code? In the code .cpu() is used to get data back to host from device - I don't understand the related comment.Pacheco
C
3

If the Tensorflow function is JIT compiled, the throughput will increase, as certain operations will be fused, and intermediate values are not written to memory, which will reduce memory bandwidth. To highlight a relevant snippet from the documentation:

Fusion is XLA's single most important optimization. Memory bandwidth is typically the scarcest resource on hardware accelerators, so removing memory operations is one of the best ways to improve performance.

In your example, we can accomplish this by adding jit_compile=True to the tf.function decorator applied to the call method.

class TimingModelTF(tf.keras.Model):
    def __init__(self, ):
        super(TimingModelTF, self).__init__()

    @tf.function(jit_compile=True)
    def call(self, x):
        return tf.cast(x, dtype=tf.float32)[0, 0]

Note: For Tensorflow 2.4 and below, change this to experimental_compile=True. Details about that keyword argument being deprecated can be found here.

On a GTX 1060, the results for the original test:

Batch size 1; Batch time 0.0005; BPS 2040.5; FPS 2040.5; MB/S 401.2
Batch size 2; Batch time 0.0007; BPS 1521.3; FPS 3042.5; MB/S 598.2
Batch size 4; Batch time 0.0006; BPS 1602.7; FPS 6410.8; MB/S 1260.4
Batch size 8; Batch time 0.0009; BPS 1112.5; FPS 8900.0; MB/S 1749.8
Batch size 16; Batch time 0.0013; BPS 760.9; FPS 12174.9; MB/S 2393.7
Batch size 32; Batch time 0.0020; BPS 498.8; FPS 15962.6; MB/S 3138.4
Batch size 64; Batch time 0.0034; BPS 290.2; FPS 18575.1; MB/S 3652.0
Batch size 128; Batch time 0.0063; BPS 158.0; FPS 20222.4; MB/S 3975.9
Batch size 256; Batch time 0.0297; BPS 33.6; FPS 8607.2; MB/S 1692.3
Batch size 512; Batch time 0.0595; BPS 16.8; FPS 8609.1; MB/S 1692.6

Peaking at around 4 GB/s. The results with the function JIT compiled:

Batch size 1; Batch time 0.0006; BPS 1610.8; FPS 1610.8; MB/S 316.7
Batch size 2; Batch time 0.0007; BPS 1500.6; FPS 3001.1; MB/S 590.0
Batch size 4; Batch time 0.0006; BPS 1744.3; FPS 6977.1; MB/S 1371.8
Batch size 8; Batch time 0.0009; BPS 1114.2; FPS 8913.9; MB/S 1752.5
Batch size 16; Batch time 0.0013; BPS 788.1; FPS 12609.8; MB/S 2479.2
Batch size 32; Batch time 0.0018; BPS 556.9; FPS 17820.8; MB/S 3503.7
Batch size 64; Batch time 0.0019; BPS 518.5; FPS 33184.4; MB/S 6524.3
Batch size 128; Batch time 0.0054; BPS 186.1; FPS 23818.1; MB/S 4682.8
Batch size 256; Batch time 0.0291; BPS 34.4; FPS 8806.2; MB/S 1731.4
Batch size 512; Batch time 0.0567; BPS 17.6; FPS 9034.3; MB/S 1776.2

Peaking at around 6.5 GB/s. The rate may be higher on bigger/newer GPUs.

For reference, when running the Torch test, the rate peaked around 7 GB/s:

Batch size 1; Batch time 0.0001; BPS 13396.1; FPS 13396.1; MB/S 2633.8
Batch size 2; Batch time 0.0001; BPS 9231.2; FPS 18462.5; MB/S 3629.9
Batch size 4; Batch time 0.0002; BPS 5752.5; FPS 23009.9; MB/S 4523.9
Batch size 8; Batch time 0.0003; BPS 3463.8; FPS 27710.1; MB/S 5448.0
Batch size 16; Batch time 0.0005; BPS 2027.8; FPS 32444.5; MB/S 6378.8
Batch size 32; Batch time 0.0010; BPS 1040.9; FPS 33308.6; MB/S 6548.7
Batch size 64; Batch time 0.0019; BPS 533.7; FPS 34155.2; MB/S 6715.2
Batch size 128; Batch time 0.0036; BPS 274.0; FPS 35069.0; MB/S 6894.8
Batch size 256; Batch time 0.0072; BPS 138.4; FPS 35425.8; MB/S 6965.0
Batch size 512; Batch time 0.0145; BPS 69.1; FPS 35391.0; MB/S 6958.2
Chancy answered 1/12, 2021 at 20:16 Comment(2)
This is interesting. I thought that this would not have any effect in this case. I'll check it on my machines and validate that this actually works when the network does something usefull. Interestingly, the transfer rate still drops for larger batches (3.6x compared to the peak value). Does this mean that I would have to optimize tensor size? Would I have to split larger batches? Batch size 256 is only 50 MB!Pacheco
There are other optimizations that could be made through parameters of tf.function (tensorflow.org/api_docs/python/tf/function#args) which may further improve performance for certain use-cases, but I don't know if they are relevant here. For example, supplying the input_signature with the known shapes of the Tensors being passed to the function could reduce tracing, but that primarily helps if you're providing multiple tensors with different shapes. If those options don't help, you may need to perform additional optimizations on your endChancy

© 2022 - 2024 — McMap. All rights reserved.