glGetBufferSubData and glMapBufferRange for GL_SHADER_STORAGE_BUFFER very slow on NVIDIA GTX960M

I've been having some issues with transfering a GPU buffer into CPU for performing sorting operations. The buffer is a GL_SHADER_STORAGE_BUFFER composed of 300.000 float values. The transfer operation with glGetBufferSubData is taking around 10ms, and with glMapBufferRange, it takes more than 100 ms.

The code Im using is the following:

std::vector<GLfloat> viewRow;
unsigned int viewRowBuffer = -1;
int length = -1;

void bindRowBuffer(unsigned int buffer){
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, buffer);
    glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, buffer);
}

void initRowBuffer(unsigned int &buffer, std::vector<GLfloat> &row, int lengthIn){
    // Generate and initialize buffer
    length = lengthIn;
    row.resize(length);
    memset(&row[0], 0, length*sizeof(float));
    glGenBuffers(1, &buffer);
    bindRowBuffer(buffer);
    glBufferStorage(GL_SHADER_STORAGE_BUFFER, row.size() * sizeof(float), &row[0], GL_DYNAMIC_STORAGE_BIT | GL_MAP_READ_BIT | GL_MAP_WRITE_BIT);

    glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
}

void cleanRowBuffer(unsigned int buffer) {
    float zero = 0.0;
    glClearNamedBufferData(buffer, GL_R32F, GL_RED, GL_FLOAT, &zero);
}

void readGPUbuffer(unsigned int buffer, std::vector<GLfloat> &row) {
    glGetBufferSubData(GL_SHADER_STORAGE_BUFFER,0,length *sizeof(float),&row[0]);
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
}

void readGPUMapBuffer(unsigned int buffer, std::vector<GLfloat> &row) {
    float* data = (float*)glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, length*sizeof(float), GL_MAP_READ_BIT); glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
     memcpy(&row[0], data, length *sizeof(float));
    glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
}

The main is doing:

    bindRowBuffer(viewRowBuffer);
    cleanRowBuffer(viewRowBuffer);
    countPixs.bind();
    glActiveTexture(GL_TEXTURE0);
    glBindTexture(GL_TEXTURE_2D, gPatch);
    countPixs.setInt("gPatch", 0);
    countPixs.run(SCR_WIDTH/8, SCR_HEIGHT/8, 1);
    countPixs.unbind();
    readGPUbuffer(viewRowBuffer, viewRow);

Where countPixs is a compute shader, but I'm possitive the problem is not there because if I comment the run command, the read takes exactly the same amount of time.

The weird thing is that if I execute a getbuffer of only 1 float:

glGetBufferSubData(GL_SHADER_STORAGE_BUFFER,0, 1 *sizeof(float),&row[0]);

It takes exactly the same time... so I'm guessing there is something wrong all-the-way... maybe related to the GL_SHADER_STORAGE_BUFFER?

This is likely to be an GPU-CPU synchronization/round trip caused delay. I.e., once you map your buffer, the previous GL command(s) that touched the buffer have to be completed immediately, causing a pipeline stall. Note that drivers are lazy: it is very probable GL commands have not even started executing yet.

If you can: glBufferStorage(..., GL_MAP_PERSISTENT_BIT) and map the buffer persistently. This avoids completely re-mapping and allocating any GPU memory, and you can keep the mapped pointer over draw calls with some caveats:

You likely also need GPU fences to detect/wait when the data is actually available from GPU. (Unless you like reading garbage.)
The mapped buffer can't be resized. (since you already use glBufferStorage() you are ok)
It is probably good idea to combine GL_MAP_PERSISTENT_BIT with GL_MAP_COHERENT_BIT

After reading GL 4.5 docs bit more I found out that glFenceSync is mandatory in order to guarantee the data has arrived from the GPU, even with GL_MAP_COHERENT_BIT:

If GL_MAP_COHERENT_BIT is set and the server does a write, the app must call glFenceSync with GL_SYNC_GPU_COMMANDS_COMPLETE (or glFinish). Then the CPU will see the writes after the sync is complete.

Recommended topics

Hot tags