OpenCL/CUDA GPU calculations at audio rates - Any method fast enough to read from GPU once per audio buffer? (ie. At minimum of ~43 FPS)

UDPATE

I found an error in my code. I was running my render function in sub-blocks from years ago and forgot I had set it as such. So it was calling the GPU read function far more often than I thought. Sorry.

ISSUE

I have recently tried adding OpenCL to an audio synthesiser that would benefit from GPU processing (due to highly parallelized math in the processing). However, I have found that even just trying to read from the GPU once per audio buffer (not even once per sample) is crippling performance and not usable.

CURRENT METHOD

I am using the OpenCL Wrapper project here: https://github.com/ProjectPhysX/OpenCL-Wrapper

Simply creating a small Memory<float> test object of 20-125 floats with it once on project initialization, and then once per audio buffer running test.read_from_device() while doing nothing else causes stuttering in the audio.

The OpenCL Wrapper function for this is:

    inline void read_from_device(const bool blocking=true, const vector<Event>* event_waitlist=nullptr, Event* event_returned=nullptr) {
        if(host_buffer_exists&&device_buffer_exists) cl_queue.enqueueReadBuffer(device_buffer, blocking, 0ull, capacity(), (void*)host_buffer, event_waitlist, event_returned);
    }

REQUIREMENTS

Audio typically must run at 44100 samples per second. Audio buffers can be acceptable up to around 1024 samples per buffer. Thus if we process one full buffer at a time on the GPU, we need to read smoothly from the GPU at a minimum of 43 times per second, or once every 23 ms.

43 times per second is less than the 60-120 fps or so a GPU can typically processes at so this should not be too unrealistic I think.

OTHER TESTS

I have read this thread and it suggests I am not alone in this problem: GPU audio processing

In particular there is the reply:

Sorry, going to disappoint you straight away. I have tried using NVidia CUDA (the native library) for audio processing, using neural networks. It's what my company does for a living, so we're pretty competent. We found that the typical NVidia card has too much latency. They're fast, that's not the problem, but that means they can do many million operations in a millisecond. However, the DMA engine feeding data to the card typically has latencies that are many milliseconds. Not so bad for video, bad for audio - video often is 60 Hz whereas audio can be 48000 Hz.

(Note here he is talking about processing every sample back and forth on the GPU, rather than each full buffer one at a time, which should be more realistic.)

WORKING SYSTEM

There exists currently a company called GPU Audio which claims to be processing audio plugins on the GPU effectively: https://www.gpu.audio/

In order to run anything audio related on the GPU, they must also at least read from the GPU once per audio buffer. Otherwise, how else can you get the audio outputted? So if GPU Audio is processing anything on the GPU, there is clearly some way to do this then.

I presume they are working with full buffers on the GPU like I describe. However, my current method is not fast enough to keep up. They must be using a faster method.

This study (from the linked Stack Overflow thread above) seems to suggest we should be able to complete a data transfer in 1.5 ms or so which should be more than enough time. But I am not getting anywhere near this performance clearly.

QUESTION

Does anyone have any ideas for how this can be done? Is there any obvious problem with the OpenCL function above? Or can you suggest a known alternative method that can read from the GPU with no more than a few ms latency so we can keep up on a per buffer basis?

Would CUDA perhaps offer faster methods? Or could a better OpenCL function be written? I would prefer to stick with OpenCL. I presume there must be some way as reading from a modern GPU 43 times a second should not be terribly unreasonable.

Thanks for any ideas.

Well, I don't know about best-practices w.r.t. latency, especially on OpenCL. But I can offer a simple benchmark measuring the round-trip time.

There are only really two things I'm doing here:

Use double-buffering to keep the GPU busy
Use pinned host memory and don't do explicit copies

Nvidia GPUs can directly access pinned host memory. While this does slow down the kernel and occupies compute resources while waiting for data transfers, it also avoids waiting for or synchronizing with copy operations.

On my hardware (Nvidia T1200 laptop and RTX 3090 desktop running Linux), this setup does the round-trip transfer of 1024 samples in 15 us consistently after the first one or two kernel calls.

Here is the code:

#include <cuda_runtime.h>

#include <algorithm>
// using std::fill_n
#include <cstdio>
// using std::printf
#include <chrono>
// using std::steady_clock


/**
 * Simple input = output kernel
 */
__global__ void kernel(unsigned* out, const unsigned* in, int n)
{
    const int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if(idx < n)
        out[idx] = in[idx];
}
/**
 * Creates a time stamp in microseconds
 *
 * No defined zero-time. Only useful for measuring relative time intervals
 */
unsigned current_time_us()
{
    using us = std::chrono::microseconds;
    return static_cast<unsigned>(std::chrono::duration_cast<us>(
            std::chrono::steady_clock::now().time_since_epoch()).count());
}
/** Fills the buffer with the current time stamp */
void fill_current_time(unsigned* buf, int n)
{
    std::fill_n(buf, n, current_time_us());
}

int main()
{
    int samples = 1024, repetitions = 100;
    int blocksize = 128;
    int gridsize = (samples + blocksize - 1) / blocksize;
    cudaStream_t stream;
    if(cudaStreamCreate(&stream))
        return 1;
    /*
     * We use pinned host memory that is directly accessible by the device and
     * the host for input and output transfer.
     * Two input and two output buffers for double-buffering
     */
    unsigned* transfer_bufs;
    if(cudaHostAlloc(&transfer_bufs, 4 * samples * sizeof(unsigned), 0))
        return 2;
    unsigned* input_bufs = transfer_bufs;
    unsigned* output_bufs = transfer_bufs + 2 * samples;
    /*
     * We use events for quick notification when a kernel is done without
     * having to synchronize the stream
     */
    cudaEvent_t output_avail[2];
    for(cudaEvent_t& event: output_avail)
        if(cudaEventCreate(&event))
            return 3;
    /*
     * Initial fill of the first double buffer
     */
    fill_current_time(input_bufs, samples);
    kernel<<<blocksize, gridsize, 0, stream>>>(
            output_bufs, input_bufs, samples);
    if(cudaEventRecord(output_avail[0], stream))
        return 4;
    for(int i = 1; i < repetitions; ++i) {
        int cur_buf = i & 1;
        int last_buf = cur_buf ^ 1;
        int cur_offset = samples * cur_buf;
        int last_offset = samples * last_buf;
        /*
         * Schedule the next computation
         */
        fill_current_time(input_bufs + cur_offset, samples);
        kernel<<<blocksize, gridsize, 0, stream>>>(
                    output_bufs + cur_offset, input_bufs + cur_offset, samples);
        if(cudaEventRecord(output_avail[cur_buf], stream))
            return 5;
        /*
         * Wait for the previous computation
         */
        if(cudaEventSynchronize(output_avail[last_buf]))
            return 6;
        /*
         * Measure the time interval from filling the input buffer to
         * receiving it back in the output buffer
         */
        std::printf("RTT %u us\n", current_time_us() - output_bufs[last_offset]);
    }
    /*
     * Wait for the last computation. No need to check the results
     */
    if(cudaEventSynchronize(output_avail[(repetitions - 1) & 1]))
        return 7;
}

Output:

RTT 94 us
RTT 22 us
RTT 12 us
RTT 15 us
RTT 15 us
RTT 15 us
RTT 15 us
RTT 15 us
...

However, I should also note that utilizing the full GPU with only 1024 samples sounds pretty much impossible. Even a single multiprocessor has more threads than that! So while transfer latency is not an issue, making actual use of the compute resources without increasing buffer sizes will.

But I don't know, maybe you mix 32 input sources with 1024 samples, each. Incidentally, increasing the transfer by a factor of 32 only increases the RTT to 60 us in my tests.

Regular DMA copies

Here is a modified version that uses cudaMemcpyAsync instead of direct access to the pinned host memory. The RTT increases only slightly to 25 us on the laptop and 40-50 us on the desktop. So I really don't know where the quoted information about the DMA engine having milliseconds latency comes from or to what hardware it applies.

#include <cuda_runtime.h>

#include <algorithm>
// using std::fill_n
#include <cstdio>
// using std::printf
#include <chrono>
// using std::steady_clock


/**
 * Simple input = output kernel
 */
__global__ void kernel(unsigned* out, const unsigned* in, int n)
{
    const int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if(idx < n)
        out[idx] = in[idx];
}
/**
 * Creates a time stamp in microseconds
 *
 * No defined zero-time. Only useful for measuring relative time intervals
 */
unsigned current_time_us()
{
    using us = std::chrono::microseconds;
    return static_cast<unsigned>(std::chrono::duration_cast<us>(
            std::chrono::steady_clock::now().time_since_epoch()).count());
}
/** Fills the buffer with the current time stamp */
void fill_current_time(unsigned* buf, int n)
{
    std::fill_n(buf, n, current_time_us());
}

int main()
{
    int samples = 1024, repetitions = 100;
    int blocksize = 128;
    int gridsize = (samples + blocksize - 1) / blocksize;
    cudaStream_t in_stream, compute_stream, out_stream;
    for(cudaStream_t* stream: {&in_stream, &compute_stream, &out_stream})
        if(cudaStreamCreate(stream))
            return 1;
    /*
     * Pinned host memory for data transfer. Double buffering
     */
    unsigned* host_bufs;
    if(cudaHostAlloc(&host_bufs, 4 * samples * sizeof(unsigned), 0))
        return 2;
    unsigned* host_input_bufs = host_bufs;
    unsigned* host_output_bufs = host_bufs + 2 * samples;
    /*
     * Device-side memory. Again double-buffering
     */
    unsigned* dev_bufs;
    if(cudaMalloc(&dev_bufs, 4 * samples * sizeof(unsigned)))
        return 3;
    unsigned* dev_input_bufs = dev_bufs;
    unsigned* dev_output_bufs = dev_bufs + 2 * samples;
    /*
     * We use events for quick notification when a kernel is done without
     * having to synchronize the stream
     */
    cudaEvent_t events[6];
    for(cudaEvent_t& event: events)
        if(cudaEventCreate(&event))
            return 4;
    cudaEvent_t* in_avail = events;
    cudaEvent_t* out_avail = events + 2;
    cudaEvent_t* out_on_host = events + 4;
    auto compute = [=](int i) {
        int cur_buf = i & 1;
        int cur_offset = samples * cur_buf;
        fill_current_time(host_input_bufs + cur_offset, samples);
        if(cudaMemcpyAsync(dev_input_bufs + cur_offset,
                host_input_bufs + cur_offset, samples * sizeof(unsigned),
                cudaMemcpyDefault, in_stream))
            return 5;
        if(cudaEventRecord(in_avail[cur_buf], in_stream))
            return 6;
        if(cudaStreamWaitEvent(compute_stream, in_avail[cur_buf], 0))
            return 7;
        kernel<<<blocksize, gridsize, 0, compute_stream>>>(
                dev_output_bufs + cur_offset, dev_input_bufs + cur_offset,
                samples);
        if(cudaEventRecord(out_avail[cur_buf], compute_stream))
            return 8;
        if(cudaStreamWaitEvent(out_stream, out_avail[cur_buf], 0))
            return 9;
        if(cudaMemcpyAsync(host_output_bufs + cur_offset,
                dev_output_bufs + cur_offset, samples * sizeof(unsigned),
                cudaMemcpyDefault, out_stream))
            return 10;
        if(cudaEventRecord(out_on_host[cur_buf], out_stream))
            return 11;
        return 0;
    };
    /*
     * Initial fill of the first double buffer
     */
    if(compute(0))
        return 12;
    for(int i = 1; i < repetitions; ++i) {
        /*
         * Schedule next computation
         */
        if(compute(i))
            return 13;
        /*
         * Wait for previous
         */
        int last_buf = (i - 1) & 1;
        int last_offset = samples * last_buf;
        if(cudaEventSynchronize(out_on_host[last_buf]))
            return 14;
        /*
         * Measure the time interval from filling the input buffer to
         * receiving it back in the output buffer
         */
        std::printf("RTT %u us\n",
                current_time_us() - host_output_bufs[last_offset]);
    }
    /*
     * Wait for the last computation. No need to check the results
     */
    if(cudaEventSynchronize(out_on_host[(repetitions - 1) & 1]))
        return 15;
}