If you look at the date of posting versus the date of this answer you'll notice I spent much time working on this. (It was my full time job the past 4 weeks).
Since I had such a difficult time getting this to work I will write up a short guide to hopefully help out whomever finds this.
Outline
The basic flow I have is OGL Frame buffer object color attachement (texture) → nvenc (nvidia encoder)
Things to note
Some things to note:
1) The nvidia encoder can accept YUV or RGB type images.
2) FFMPEG 4.0 and under cannot pass RGB images to nvenc.
3) FFMPEG was updated to accept RGB as input, per my issues.
There are a couple different things to know about:
1) AVHWDeviceContext- Think of this as ffmpegs device abstraction layer.
2) AVHWFramesContext- Think of this as ffmpegs hardware frame abstraction layer.
3) cuMemcpy2D- The required method to copy a cuda mapped OGL texture into a cuda buffer created by ffmpeg.
Comprehensiveness
This guide is in addition to standard software encoding guidelines. This is NOT complete code, and should only be used in addition to the standard flow.
Code details
Setup
You will need to first get your gpu name, to do this I found some code (I cannot remember where I got it from) that made some cuda calls and got the GPU name:
int getDeviceName(std::string& gpuName)
{
//Setup the cuda context for hardware encoding with ffmpeg
NV_ENC_BUFFER_FORMAT eFormat = NV_ENC_BUFFER_FORMAT_IYUV;
int iGpu = 0;
CUresult res;
ck(cuInit(0));
int nGpu = 0;
ck(cuDeviceGetCount(&nGpu));
if (iGpu < 0 || iGpu >= nGpu)
{
std::cout << "GPU ordinal out of range. Should be within [" << 0 << ", "
<< nGpu - 1 << "]" << std::endl;
return 1;
}
CUdevice cuDevice = 0;
ck(cuDeviceGet(&cuDevice, iGpu));
char szDeviceName[80];
ck(cuDeviceGetName(szDeviceName, sizeof(szDeviceName), cuDevice));
gpuName = szDeviceName;
epLog::msg(epMSG_STATUS, "epVideoEncode:H264Encoder", "...using device \"%s\"", szDeviceName);
return 0;
}
Next you will need to setup your hwdevice and hwframe contexts:
getDeviceName(gpuName);
ret = av_hwdevice_ctx_create(&m_avBufferRefDevice, AV_HWDEVICE_TYPE_CUDA, gpuName.c_str(), NULL, NULL);
if (ret < 0)
{
return -1;
}
//Example of casts needed to get down to the cuda context
AVHWDeviceContext* hwDevContext = (AVHWDeviceContext*)(m_avBufferRefDevice->data);
AVCUDADeviceContext* cudaDevCtx = (AVCUDADeviceContext*)(hwDevContext->hwctx);
m_cuContext = &(cudaDevCtx->cuda_ctx);
//Create the hwframe_context
// This is an abstraction of a cuda buffer for us. This enables us to, with one call, setup the cuda buffer and ready it for input
m_avBufferRefFrame = av_hwframe_ctx_alloc(m_avBufferRefDevice);
//Setup some values before initialization
AVHWFramesContext* frameCtxPtr = (AVHWFramesContext*)(m_avBufferRefFrame->data);
frameCtxPtr->width = width;
frameCtxPtr->height = height;
frameCtxPtr->sw_format = AV_PIX_FMT_0BGR32; // There are only certain supported types here, we need to conform to these types
frameCtxPtr->format = AV_PIX_FMT_CUDA;
frameCtxPtr->device_ref = m_avBufferRefDevice;
frameCtxPtr->device_ctx = (AVHWDeviceContext*)m_avBufferRefDevice->data;
//Initialization - This must be done to actually allocate the cuda buffer.
// NOTE: This call will only work for our input format if the FFMPEG library is >4.0 version..
ret = av_hwframe_ctx_init(m_avBufferRefFrame);
if (ret < 0) {
return -1;
}
//Cast the OGL texture/buffer to cuda ptr
CUresult res;
CUcontext oldCtx;
m_inputTexture = texture;
res = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL
res = cuCtxPushCurrent(*m_cuContext);
res = cuGraphicsGLRegisterImage(&cuInpTexRes, m_inputTexture, GL_TEXTURE_2D, CU_GRAPHICS_REGISTER_FLAGS_READ_ONLY);
res = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL
//Assign some hardware accel specific data to AvCodecContext
c->hw_device_ctx = m_avBufferRefDevice;//This must be done BEFORE avcodec_open2()
c->pix_fmt = AV_PIX_FMT_CUDA; //Since this is a cuda buffer, although its really opengl with a cuda ptr
c->hw_frames_ctx = m_avBufferRefFrame;
c->codec_type = AVMEDIA_TYPE_VIDEO;
c->sw_pix_fmt = AV_PIX_FMT_0BGR32;
// Setup some cuda stuff for memcpy-ing later
m_memCpyStruct.srcXInBytes = 0;
m_memCpyStruct.srcY = 0;
m_memCpyStruct.srcMemoryType = CUmemorytype::CU_MEMORYTYPE_ARRAY;
m_memCpyStruct.dstXInBytes = 0;
m_memCpyStruct.dstY = 0;
m_memCpyStruct.dstMemoryType = CUmemorytype::CU_MEMORYTYPE_DEVICE;
Keep in mind, although there is a lot done above, the code shown is IN ADDITION to the standard software encoding code. Make sure to include all those calls/object initialization as well.
Unlike the software version, all that is needed for the input AVFrame object is to get the buffer AFTER your alloc call:
// allocate RGB video frame buffer
ret = av_hwframe_get_buffer(m_avBufferRefFrame, rgb_frame, 0); // 0 is for flags, not used at the moment
Notice it takes in the hwframe_context as an argument, this is how it knows what device, size, format, etc to allocate for on the gpu.
Call to encode each frame
Now we are setup, and are ready to encode. Before each encode we need to copy the frame from the texture to a cuda buffer. We do this by mapping a cuda array to the texture then copying that array to a cuDeviceptr (which was allocated by the av_hwframe_get_buffer call above):
//Perform cuda mem copy for input buffer
CUresult cuRes;
CUarray mappedArray;
CUcontext oldCtx;
//Get context
cuRes = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL
cuRes = cuCtxPushCurrent(*m_cuContext);
//Get Texture
cuRes = cuGraphicsResourceSetMapFlags(cuInpTexRes, CU_GRAPHICS_MAP_RESOURCE_FLAGS_READ_ONLY);
cuRes = cuGraphicsMapResources(1, &cuInpTexRes, 0);
//Map texture to cuda array
cuRes = cuGraphicsSubResourceGetMappedArray(&mappedArray, cuInpTexRes, 0, 0); // Nvidia says its good practice to remap each iteration as OGL can move things around
//Release texture
cuRes = cuGraphicsUnmapResources(1, &cuInpTexRes, 0);
//Setup for memcopy
m_memCpyStruct.srcArray = mappedArray;
m_memCpyStruct.dstDevice = (CUdeviceptr)rgb_frame->data[0]; // Make sure to copy devptr as it could change, upon resize
m_memCpyStruct.dstPitch = rgb_frame->linesize[0]; // Linesize is generated by hwframe_context
m_memCpyStruct.WidthInBytes = rgb_frame->width * 4; //* 4 needed for each pixel
m_memCpyStruct.Height = rgb_frame->height; //Vanilla height for frame
//Do memcpy
cuRes = cuMemcpy2D(&m_memCpyStruct);
//release context
cuRes = cuCtxPopCurrent(&oldCtx); // THIS IS ALLOWED TO FAIL
Now we can simply call send_frame and it all works!
ret = avcodec_send_frame(c, rgb_frame);
Note: I left most of my code out, since it is not for the public. I may have some details incorrect, this is how I was able to make sense of all the data I gathered over the past month...feel free to correct anything that is incorrect. Also, fun fact, during all this my computer crashed an I lost all my initial investigation (everything I didnt check into source control), which includes all the various example code I found around the internet. So if you see something an its yours, call it out please. This can help others come to the conclusion that I came to.
Shoutout
Big shout out to BtbN at https://webchat.freenode.net/ #ffmpeg, I wouldnt have gotten any of this without their help.