C

3

10

I have a opengl buffer that I need to forward directly to ffmpeg to do the nvenc based h264 encoding.

My current way of doing this is glReadPixels to get the pixels out of the frame buffer and then passing that pointer into ffmpeg such that it can encode the frame into H264 packets for RTSP. However, this is bad because I have to copy bytes out of the GPU ram into CPU ram, to only copy them back into the GPU for encoding.

Coed answered 16/4, 2018 at 16:58 Comment(0)

C

26

If you look at the date of posting versus the date of this answer you'll notice I spent much time working on this. (It was my full time job the past 4 weeks).

Since I had such a difficult time getting this to work I will write up a short guide to hopefully help out whomever finds this.

Outline

The basic flow I have is OGL Frame buffer object color attachement (texture) → nvenc (nvidia encoder)

Things to note

Some things to note:
1) The nvidia encoder can accept YUV or RGB type images.
2) FFMPEG 4.0 and under cannot pass RGB images to nvenc.
3) FFMPEG was updated to accept RGB as input, per my issues.

There are a couple different things to know about:
1) AVHWDeviceContext- Think of this as ffmpegs device abstraction layer.
2) AVHWFramesContext- Think of this as ffmpegs hardware frame abstraction layer.
3) cuMemcpy2D- The required method to copy a cuda mapped OGL texture into a cuda buffer created by ffmpeg.

Comprehensiveness

This guide is in addition to standard software encoding guidelines. This is NOT complete code, and should only be used in addition to the standard flow.

Code details

Setup

You will need to first get your gpu name, to do this I found some code (I cannot remember where I got it from) that made some cuda calls and got the GPU name:

int getDeviceName(std::string& gpuName)
{
//Setup the cuda context for hardware encoding with ffmpeg
NV_ENC_BUFFER_FORMAT eFormat = NV_ENC_BUFFER_FORMAT_IYUV;
int iGpu = 0;
CUresult res;
ck(cuInit(0));
int nGpu = 0;
ck(cuDeviceGetCount(&nGpu));
if (iGpu < 0 || iGpu >= nGpu)
{
    std::cout << "GPU ordinal out of range. Should be within [" << 0 << ", " 
<< nGpu - 1 << "]" << std::endl;
    return 1;
}
CUdevice cuDevice = 0;
ck(cuDeviceGet(&cuDevice, iGpu));
char szDeviceName[80];
ck(cuDeviceGetName(szDeviceName, sizeof(szDeviceName), cuDevice));
gpuName = szDeviceName;
epLog::msg(epMSG_STATUS, "epVideoEncode:H264Encoder", "...using device \"%s\"", szDeviceName);

return 0;
}

Next you will need to setup your hwdevice and hwframe contexts:

    getDeviceName(gpuName);
    ret = av_hwdevice_ctx_create(&m_avBufferRefDevice, AV_HWDEVICE_TYPE_CUDA, gpuName.c_str(), NULL, NULL);
    if (ret < 0) 
    {
        return -1;
    }

    //Example of casts needed to get down to the cuda context
    AVHWDeviceContext* hwDevContext = (AVHWDeviceContext*)(m_avBufferRefDevice->data);
    AVCUDADeviceContext* cudaDevCtx = (AVCUDADeviceContext*)(hwDevContext->hwctx);
    m_cuContext = &(cudaDevCtx->cuda_ctx);

    //Create the hwframe_context
    //  This is an abstraction of a cuda buffer for us. This enables us to, with one call, setup the cuda buffer and ready it for input
    m_avBufferRefFrame = av_hwframe_ctx_alloc(m_avBufferRefDevice);

    //Setup some values before initialization 
    AVHWFramesContext* frameCtxPtr = (AVHWFramesContext*)(m_avBufferRefFrame->data);
    frameCtxPtr->width = width;
    frameCtxPtr->height = height;
    frameCtxPtr->sw_format = AV_PIX_FMT_0BGR32; // There are only certain supported types here, we need to conform to these types
    frameCtxPtr->format = AV_PIX_FMT_CUDA;
    frameCtxPtr->device_ref = m_avBufferRefDevice;
    frameCtxPtr->device_ctx = (AVHWDeviceContext*)m_avBufferRefDevice->data;

    //Initialization - This must be done to actually allocate the cuda buffer. 
    //  NOTE: This call will only work for our input format if the FFMPEG library is >4.0 version..
    ret = av_hwframe_ctx_init(m_avBufferRefFrame);
    if (ret < 0) {
        return -1;
    }

    //Cast the OGL texture/buffer to cuda ptr
    CUresult res;
    CUcontext oldCtx;
    m_inputTexture = texture;
    res = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL
    res = cuCtxPushCurrent(*m_cuContext);
    res = cuGraphicsGLRegisterImage(&cuInpTexRes, m_inputTexture, GL_TEXTURE_2D, CU_GRAPHICS_REGISTER_FLAGS_READ_ONLY);
    res = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL

    //Assign some hardware accel specific data to AvCodecContext 
    c->hw_device_ctx = m_avBufferRefDevice;//This must be done BEFORE avcodec_open2()
    c->pix_fmt = AV_PIX_FMT_CUDA; //Since this is a cuda buffer, although its really opengl with a cuda ptr
    c->hw_frames_ctx = m_avBufferRefFrame;
    c->codec_type = AVMEDIA_TYPE_VIDEO;
    c->sw_pix_fmt = AV_PIX_FMT_0BGR32;

    // Setup some cuda stuff for memcpy-ing later
    m_memCpyStruct.srcXInBytes = 0;
    m_memCpyStruct.srcY = 0;
    m_memCpyStruct.srcMemoryType = CUmemorytype::CU_MEMORYTYPE_ARRAY;

    m_memCpyStruct.dstXInBytes = 0;
    m_memCpyStruct.dstY = 0;
    m_memCpyStruct.dstMemoryType = CUmemorytype::CU_MEMORYTYPE_DEVICE;

Keep in mind, although there is a lot done above, the code shown is IN ADDITION to the standard software encoding code. Make sure to include all those calls/object initialization as well.

Unlike the software version, all that is needed for the input AVFrame object is to get the buffer AFTER your alloc call:

// allocate RGB video frame buffer
    ret = av_hwframe_get_buffer(m_avBufferRefFrame, rgb_frame, 0);  // 0 is for flags, not used at the moment

Notice it takes in the hwframe_context as an argument, this is how it knows what device, size, format, etc to allocate for on the gpu.

Call to encode each frame

Now we are setup, and are ready to encode. Before each encode we need to copy the frame from the texture to a cuda buffer. We do this by mapping a cuda array to the texture then copying that array to a cuDeviceptr (which was allocated by the av_hwframe_get_buffer call above):

//Perform cuda mem copy for input buffer
CUresult cuRes;
CUarray mappedArray;
CUcontext oldCtx;

//Get context
cuRes = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL
cuRes = cuCtxPushCurrent(*m_cuContext);

//Get Texture
cuRes = cuGraphicsResourceSetMapFlags(cuInpTexRes, CU_GRAPHICS_MAP_RESOURCE_FLAGS_READ_ONLY);
cuRes = cuGraphicsMapResources(1, &cuInpTexRes, 0);

//Map texture to cuda array
cuRes = cuGraphicsSubResourceGetMappedArray(&mappedArray, cuInpTexRes, 0, 0); // Nvidia says its good practice to remap each iteration as OGL can move things around

//Release texture
cuRes = cuGraphicsUnmapResources(1, &cuInpTexRes, 0);

//Setup for memcopy
m_memCpyStruct.srcArray = mappedArray;
m_memCpyStruct.dstDevice = (CUdeviceptr)rgb_frame->data[0]; // Make sure to copy devptr as it could change, upon resize
m_memCpyStruct.dstPitch = rgb_frame->linesize[0];   // Linesize is generated by hwframe_context
m_memCpyStruct.WidthInBytes = rgb_frame->width * 4; //* 4 needed for each pixel
m_memCpyStruct.Height = rgb_frame->height;          //Vanilla height for frame

//Do memcpy
cuRes = cuMemcpy2D(&m_memCpyStruct); 

//release context
cuRes = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL

Now we can simply call send_frame and it all works!

        ret = avcodec_send_frame(c, rgb_frame);

Note: I left most of my code out, since it is not for the public. I may have some details incorrect, this is how I was able to make sense of all the data I gathered over the past month...feel free to correct anything that is incorrect. Also, fun fact, during all this my computer crashed an I lost all my initial investigation (everything I didnt check into source control), which includes all the various example code I found around the internet. So if you see something an its yours, call it out please. This can help others come to the conclusion that I came to.

Shoutout

Big shout out to BtbN at https://webchat.freenode.net/ #ffmpeg, I wouldnt have gotten any of this without their help.

Coed answered 14/5, 2018 at 22:40 Comment(2)

Thank you for taking the time to write out the process! Did you see any significant performance difference using this method? – Sada 10/8, 2019 at 21:19

Yes. The hardware method was about 10 times faster or so for our use. – Coed 11/8, 2019 at 22:4

D

0

First thing to check is that it may be "bad" but is it running fast enough anyway? It's always nice to be more efficient but if it works, don't break it.

If there really is a performance problem...

1 Use FFMPEG software encoding only, without hardware assistance. Then you'll only be copying from GPU to CPU once. (If the video encoder is on the GPU and you're sending packets out via RTSP, there's a second GPU to CPU after encoding.)

2 Look for an NVIDIA (I assume that's the GPU given you talk about nvenc) GL extension to texture formats and/or commands that will perform on GPU H264 encoding directly to OpenGL buffers.

Dniester answered 16/4, 2018 at 22:25 Comment(1)

Your setup working very well until i updated to ffmpeg 5.1 with huggge performance gain. But now same code create an exception in avcodec_open2 without any help message. – Sherlocke 25/10, 2022 at 12:31

L

-2

Manually setting the device_ref and/or device_ctx fields of the AVHWFramesContext is not necessary, as they are already set using the reference provided to av_hwframe_ctx_alloc(). More importantly, the way it is done here breaks the reference counting provided by AVBufferRef class. New references to these should be made using av_buffer_ref().

Lorenza answered 8/5 at 13:13 Comment(0)