Efficiently transfer large file (up to 2GB) to CUDA GPU?
Asked Answered
S

2

6

I'm working on an a GPU accelerated program that requires the reading of an entire file of variable size. My question, what is the optimal number of bytes to read from a file and transfer to a coprocessor (CUDA device)?

These files could be as large as 2GiB, so creating a buffer of that size doesn't seem like the best idea.

Scotism answered 16/3, 2012 at 3:2 Comment(0)
M
4

You can cudaMalloc a buffer of the maximum size you can on your device. After this, copy over chunks of your input data of this size from host to device, process it, copy back the results and continue.

// Your input data on host
int hostBufNum = 5600000;
int* hostBuf   = ...;

// Assume this is largest device buffer you can allocate
int devBufNum = 1000000;
int* devBuf;

cudaMalloc( &devBuf, sizeof( int ) * devBufNum );

int* hostChunk  = hostBuf;
int hostLeft    = hostBufNum;
int chunkNum    = ( hostLeft < devBufNum ) ? hostLeft : devBufNum;

do
{
    cudaMemcpy( devBuf, hostChunk, chunkNum * sizeof( int ) , cudaMemcpyHostToDevice);
    doSomethingKernel<<< >>>( devBuf, chunkNum );

    hostChunk   = hostChunk + chunkNum;
    hostLeft    = hostBufNum - ( hostChunk - hostBuf );
} while( hostLeft > 0 );    
Melonymelos answered 16/3, 2012 at 3:7 Comment(5)
That part I was already planning, but what size should the chunks of input data be?Scotism
The size of the largest array you can allocate on device.Melonymelos
You could consider using async memcopies of somewhat smaller chunks than will fit in memory (at most half) and processing chunk k in parallel with transferring chunk k-1 back to the host and transferring chunk k+1 from the host to device. Bidirectional overlap requires a Tesla GPU, but you can overlap one direction even on GeForce.Grubbs
Also, you can use cuGetMemInfo to get the amount of available memory. forums.nvidia.com/index.php?showtopic=102339Donatus
@JasonR.Mick: since about CUDA 3.1 there has been cudaGetMemInfo in the runtime API, which does the same thing but saves having to mix runtime and driver APIs in host code.Billbillabong
G
0

If you can split your function up so you can work on chunks on the card, you should look into using streams (cudaStream_t).

If you schedule loads and kernel executions in several streams, you can have one stream load data while another executes a kernel on the card, thereby hiding some of the transfer time of your data in the execution of a kernel.

You need to declare a buffer of whatever your chunk size is times however many streams you declare (up to 16, for compute capability 1.x as far as I know).

Governorship answered 27/3, 2012 at 1:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.