UDPATE
I found an error in my code. I was running my render function in sub-blocks from years ago and forgot I had set it as such. So it was calling the GPU read function far more often than I thought. Sorry.
ISSUE
I have recently tried adding OpenCL to an audio synthesiser that would benefit from GPU processing (due to highly parallelized math in the processing). However, I have found that even just trying to read from the GPU once per audio buffer (not even once per sample) is crippling performance and not usable.
CURRENT METHOD
I am using the OpenCL Wrapper project here: https://github.com/ProjectPhysX/OpenCL-Wrapper
Simply creating a small Memory<float> test
object of 20-125 floats with it once on project initialization, and then once per audio buffer running test.read_from_device()
while doing nothing else causes stuttering in the audio.
The OpenCL Wrapper function for this is:
inline void read_from_device(const bool blocking=true, const vector<Event>* event_waitlist=nullptr, Event* event_returned=nullptr) {
if(host_buffer_exists&&device_buffer_exists) cl_queue.enqueueReadBuffer(device_buffer, blocking, 0ull, capacity(), (void*)host_buffer, event_waitlist, event_returned);
}
REQUIREMENTS
Audio typically must run at 44100 samples per second. Audio buffers can be acceptable up to around 1024 samples per buffer. Thus if we process one full buffer at a time on the GPU, we need to read smoothly from the GPU at a minimum of 43 times per second, or once every 23 ms.
43 times per second is less than the 60-120 fps or so a GPU can typically processes at so this should not be too unrealistic I think.
OTHER TESTS
I have read this thread and it suggests I am not alone in this problem: GPU audio processing
In particular there is the reply:
Sorry, going to disappoint you straight away. I have tried using NVidia CUDA (the native library) for audio processing, using neural networks. It's what my company does for a living, so we're pretty competent. We found that the typical NVidia card has too much latency. They're fast, that's not the problem, but that means they can do many million operations in a millisecond. However, the DMA engine feeding data to the card typically has latencies that are many milliseconds. Not so bad for video, bad for audio - video often is 60 Hz whereas audio can be 48000 Hz.
(Note here he is talking about processing every sample back and forth on the GPU, rather than each full buffer one at a time, which should be more realistic.)
WORKING SYSTEM
There exists currently a company called GPU Audio which claims to be processing audio plugins on the GPU effectively: https://www.gpu.audio/
In order to run anything audio related on the GPU, they must also at least read from the GPU once per audio buffer. Otherwise, how else can you get the audio outputted? So if GPU Audio is processing anything on the GPU, there is clearly some way to do this then.
I presume they are working with full buffers on the GPU like I describe. However, my current method is not fast enough to keep up. They must be using a faster method.
This study (from the linked Stack Overflow thread above) seems to suggest we should be able to complete a data transfer in 1.5 ms or so which should be more than enough time. But I am not getting anywhere near this performance clearly.
QUESTION
Does anyone have any ideas for how this can be done? Is there any obvious problem with the OpenCL function above? Or can you suggest a known alternative method that can read from the GPU with no more than a few ms latency so we can keep up on a per buffer basis?
Would CUDA perhaps offer faster methods? Or could a better OpenCL function be written? I would prefer to stick with OpenCL. I presume there must be some way as reading from a modern GPU 43 times a second should not be terribly unreasonable.
Thanks for any ideas.