CUDA和FFMPEG硬件解码视频流
来源:互联网 发布:iphone7 二次曝光软件 编辑:程序博客网 时间:2024/06/04 23:00
本文主要讲述了通过FFMPEG获取H264格式的RTSP流数据(也可以获取本地视频文件),并通过CUDA进行硬件解码的过程。其他博客给出的教程要么只是给出了伪代码,非常的模糊,要么是基于D3D进行显示,使得给出的源码非常复杂,而无法看出CUDA解码的核心框架,而本文将其他非核心部分剥离出去,视频播放部分通过opencv调用cv::mat显示。
当然本博客的工作也参考了其他博客的内容,CSDN上原创的东西比较难找,大部分都是转载的,所以大家还是积极的贡献力量吧。
本文将分为以下两个部分:
1.CUDA硬件解码核心原理和框架解释;
2.解码核心功能代码的实现
CUDA硬件解码核心原理和框架
做过FFMPEG解码开发的同学肯定都对以下函数比较熟悉avcodec_decode_video2(),该函数实现可以解码从视频流中获取的数据包AVPACKET转化为AV_FRAME,AV_FRAME中包含了解码后的数据。通过CUDA硬件进行解码,最核心的思想就通过回调函数形式来调用CUDA硬件解码接口,对该函数替换,将CPU解码功能转移到GPU中去。
博客给出了一个很好的基础性框架,本文也是借鉴了该博客,该博客中修改了原始的VideoSource,将视频流的获取改为了ffmpeg,而CUDA解码部分框架如下图
1.1 VideoSource
VideoSourceData中包含了CUvideoparser和FrameQueue,通过上图可以看出,CUvideoparser是在VideoDecoder基础上实现了接口的封装,而VideoSource则是通过CUvideoparser进行解码。FrameQueue是存储硬件解码后图像的队列,注意硬件解码完的图像是存放在GPU显存里面了,而VideoDecoder中函数mapFrame,可完成从显存到内存的映射。
1.2 VideoParser
VideoParser中最重要的是三个回调函数,static int CUDAAPI HandleVideoSequence(void *pUserData, CUVIDEOFORMAT *pFormat), HandlePictureDecode(void *pUserData, CUVIDPICPARAMS *pPicParams),HandlePictureDisplay(void *pUserData, CUVIDPARSERDISPINFO *pPicParams),实现对视频格式变换、视频解码、解码后显示等处理功能。HandleVideoSequence主要负责视频格式进行校验,没有实现其他功能,解码函数HandlePictureDecode调用的就是VideoDecoder的解码函数(CUDA的接口),显示函数HandlePictureDisplay完成了解码后GPU图像进入FrameQueue。
1.3 VideoDecoder
该类是最核心的硬件解码功能类,CUVIDDECODECREATEINFO oVideoDecodeCreateInfo_是创建解码信息结构体,CUvideodecoder oDecoder_是最内核的CUDA硬件解码器,VideoParser的解码功能实际上是在CUvideodecoder解码内核上封装实现的(层层封装导致源码有点复杂,所以想看懂实现机制需要有点耐心)。
2 核心解码模块的实现
示例中NvDecodeD3D9.cpp实现了D3D环境的创建,CUDA模块的初始化,其中取视频帧图像显示的函数如下,该函数实现了从解码图像队列取出图像(实际上是显存指针),完成格式转换(NV12到ARGB),最后映射到D3D的Texture进行显示等功能,代码中我给出了关键部位的解释。
bool copyDecodedFrameToTexture(unsigned int &nRepeats, int bUseInterop, int *pbIsProgressive){ CUVIDPARSERDISPINFO oDisplayInfo; if (g_pFrameQueue->dequeue(&oDisplayInfo)) { CCtxAutoLock lck(g_CtxLock); // Push the current CUDA context (only if we are using CUDA decoding path) CUresult result = cuCtxPushCurrent(g_oContext); //创建解码图像的显存指针,注意存储的是NV12格式的 CUdeviceptr pDecodedFrame[3] = { 0, 0, 0 }; //用于解码图像后进行格式转换 CUdeviceptr pInteropFrame[3] = { 0, 0, 0 }; *pbIsProgressive = oDisplayInfo.progressive_frame; g_bIsProgressive = oDisplayInfo.progressive_frame ? true : false; int num_fields = 1; if (g_bUseVsync) { num_fields = std::min(2 + oDisplayInfo.repeat_first_field, 3); } nRepeats = num_fields; CUVIDPROCPARAMS oVideoProcessingParameters; memset(&oVideoProcessingParameters, 0, sizeof(CUVIDPROCPARAMS)); oVideoProcessingParameters.progressive_frame = oDisplayInfo.progressive_frame; oVideoProcessingParameters.top_field_first = oDisplayInfo.top_field_first; oVideoProcessingParameters.unpaired_field = (oDisplayInfo.repeat_first_field < 0); for (int active_field = 0; active_field < num_fields; active_field++) { unsigned int nDecodedPitch = 0; unsigned int nWidth = 0; unsigned int nHeight = 0; oVideoProcessingParameters.second_field = active_field; // map decoded video frame to CUDA surfae // 调用Videodecoder中映射功能,找到解码后图像的显存地址,并得到Pitch关键参数 if (g_pVideoDecoder->mapFrame(oDisplayInfo.picture_index, &pDecodedFrame[active_field], &nDecodedPitch, &oVideoProcessingParameters) != CUDA_SUCCESS) { // release the frame, so it can be re-used in decoder g_pFrameQueue->releaseFrame(&oDisplayInfo); // Detach from the Current thread checkCudaErrors(cuCtxPopCurrent(NULL)); return false; } nWidth = g_pVideoDecoder->targetWidth(); nHeight = g_pVideoDecoder->targetHeight(); // map DirectX texture to CUDA surface size_t nTexturePitch = 0; // If we are Encoding and this is the 1st Frame, we make sure we allocate system memory for readbacks if (g_bReadback && g_bFirstFrame && g_ReadbackSID) { CUresult result; checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[0], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2))); checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[1], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2))); checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[2], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2))); checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[3], (nDecodedPitch * nHeight + nDecodedPitch*nHeight/2))); checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[4], (nDecodedPitch * nHeight + nDecodedPitch*nHeight / 2))); checkCudaErrors(result = cuMemAllocHost((void **)&g_pFrameYUV[5], (nDecodedPitch * nHeight + nDecodedPitch*nHeight / 2))); g_bFirstFrame = false; if (result != CUDA_SUCCESS) { printf("cuMemAllocHost returned %d\n", (int)result); checkCudaErrors(result); } } // If streams are enabled, we can perform the readback to the host while the kernel is executing if (g_bReadback && g_ReadbackSID) { CUresult result = cuMemcpyDtoHAsync(g_pFrameYUV[active_field], pDecodedFrame[active_field], (nDecodedPitch * nHeight * 3 / 2), g_ReadbackSID); if (result != CUDA_SUCCESS) { printf("cuMemAllocHost returned %d\n", (int)result); checkCudaErrors(result); } }#if ENABLE_DEBUG_OUT printf("%s = %02d, PicIndex = %02d, OutputPTS = %08d\n", (oDisplayInfo.progressive_frame ? "Frame" : "Field"), g_DecodeFrameCount, oDisplayInfo.picture_index, oDisplayInfo.timestamp);#endif if (g_pImageDX) { // map the texture surface g_pImageDX->map(&pInteropFrame[active_field], &nTexturePitch, active_field); } else { pInteropFrame[active_field] = g_pInteropFrame[active_field]; nTexturePitch = g_pVideoDecoder->targetWidth() * 2; } // perform post processing on the CUDA surface (performs colors space conversion and post processing) // comment this out if we inclue the line of code seen above //调用CUDA功能模块,完成从NV12格式到ARGB格式的转换,该功能模块比较复杂,后面我将给出一个简单的实现方式 cudaPostProcessFrame(&pDecodedFrame[active_field], nDecodedPitch, &pInteropFrame[active_field], nTexturePitch, g_pCudaModule->getModule(), g_kernelNV12toARGB, g_KernelSID); if (g_pImageDX) { // unmap the texture surface g_pImageDX->unmap(active_field); } // unmap video frame // unmapFrame() synchronizes with the VideoDecode API (ensures the frame has finished decoding) g_pVideoDecoder->unmapFrame(pDecodedFrame[active_field]); g_DecodeFrameCount++; if (g_bWriteFile) { checkCudaErrors(cuStreamSynchronize(g_ReadbackSID)); SaveFrameAsYUV(g_pFrameYUV[active_field + 3], g_pFrameYUV[active_field], nWidth, nHeight, nDecodedPitch); } } // Detach from the Current thread checkCudaErrors(cuCtxPopCurrent(NULL)); // release the frame, so it can be re-used in decoder g_pFrameQueue->releaseFrame(&oDisplayInfo);}else{ // Frame Queue has no frames, we don't compute FPS until we start return false;}// check if decoding has come to an end.// if yes, signal the app to shut down.if (!g_pVideoSource->isStarted() && g_pFrameQueue->isEndOfDecode() && g_pFrameQueue->isEmpty()){ // Let's free the Frame Data if (g_ReadbackSID && g_pFrameYUV) { cuMemFreeHost((void *)g_pFrameYUV[0]); cuMemFreeHost((void *)g_pFrameYUV[1]); cuMemFreeHost((void *)g_pFrameYUV[2]); cuMemFreeHost((void *)g_pFrameYUV[3]); cuMemFreeHost((void *)g_pFrameYUV[4]); cuMemFreeHost((void *)g_pFrameYUV[5]); g_pFrameYUV[0] = NULL; g_pFrameYUV[1] = NULL; g_pFrameYUV[2] = NULL; g_pFrameYUV[3] = NULL; g_pFrameYUV[4] = NULL; g_pFrameYUV[5] = NULL; } // Let's just stop, and allow the user to quit, so they can at least see the results g_pVideoSource->stop(); // If we want to loop reload the video file and restart if (g_bLoop && !g_bAutoQuit) { HRESULT hr = reinitCudaResources(); if (SUCCEEDED(hr)) { g_FrameCount = 0; g_DecodeFrameCount = 0; g_pVideoSource->start(); } } if (g_bAutoQuit) { g_bDone = true; }}return true;}
以上功能模块与D3D掺和在一起,难以找到解码后取图像数据功能的核心模块,下面我给出基于opencv Mat的取图像数据方法,代码如下:
bool GetGpuDecodeFrame(shared_ptr ptr_video_stream, Mat &frame)
{
CUVIDPARSERDISPINFO oDisplayInfo;
if (ptr_video_stream->p_cuda_frame_queue){ if (ptr_video_stream->p_cuda_frame_queue->FrameNumInQueue()>0 && ptr_video_stream->p_cuda_frame_queue->dequeue(&oDisplayInfo)) { CCtxAutoLock lck(cuvideo_ctx_lock_); CUresult result = cuCtxPushCurrent(cuda_context_); CUdeviceptr pDecodedFrame = 0; int num_fields = 1; CUVIDPROCPARAMS oVideoProcessingParameters; memset(&oVideoProcessingParameters, 0, sizeof(CUVIDPROCPARAMS)); oVideoProcessingParameters.progressive_frame = oDisplayInfo.progressive_frame; oVideoProcessingParameters.top_field_first = oDisplayInfo.top_field_first; oVideoProcessingParameters.unpaired_field = (oDisplayInfo.repeat_first_field < 0); oVideoProcessingParameters.second_field = 0; unsigned int nDecodedPitch = 0; unsigned int nWidth = 0; unsigned int nHeight = 0; //找到图像数据GPU显存地址 if (ptr_video_stream->p_cuda_video_decoder->mapFrame(oDisplayInfo.picture_index, &pDecodedFrame, &nDecodedPitch, &oVideoProcessingParameters) != CUDA_SUCCESS) { // release the frame, so it can be re-used in decoder ptr_video_stream->p_cuda_frame_queue->releaseFrame(&oDisplayInfo); // Detach from the Current thread checkCudaErrors(cuCtxPopCurrent(NULL)); return false; } nWidth = ptr_video_stream->p_cuda_video_decoder->targetWidth(); nHeight = ptr_video_stream->p_cuda_video_decoder->targetHeight(); Mat raw_frame = Mat::zeros(cvSize(nWidth, nHeight), CV_8UC3); //直接对显存图像数据进行NV12到RGB格式的转换,并将转换后的数据拷贝到内存 VasGpuBoost::ColorConvert::Get()->ConvertD2HYUV422pToRGB24((uchar*)pDecodedFrame, raw_frame.data, nWidth, nHeight, nDecodedPitch); resize(raw_frame, frame, cvSize(ptr_video_stream->decode_param_.dst_width(), ptr_video_stream->decode_param_.dst_height())); raw_frame.release(); ptr_video_stream->p_cuda_video_decoder->unmapFrame(pDecodedFrame); checkCudaErrors(cuCtxPopCurrent(NULL)); ptr_video_stream->p_cuda_frame_queue->releaseFrame(&oDisplayInfo); pts = 0; No = 0; return true; } else return false;}else return false;}
解码用到的cuda核心函数如下,需要强调的Ptich的大小并不是图像宽度,而是解码后图像存放数据行的宽度,通常情况下要比图像宽度要大,实际上格式转换过程参考了博客。常见图像格式转换
__global__ void DevYuv420iToRgb(const uchar* yuv_data, uchar *rgb_data, const int width, const int height, const int pitch, const uchar *table_r, const uchar *table_g, const uchar *table_b){ int i = threadIdx.x + blockIdx.x * blockDim.x; int64 size = pitch*height; int64 compute_size = width*height; int x, y; x = i % width; y = i / width; CUDA_KERNEL_LOOP(i, compute_size) { int y_offset = pitch * y + x; int nv_index = y / 2 * pitch + x - x % 2; int v_offset = size + nv_index; int u_offset = v_offset + 1; int Y = *(yuv_data + y_offset); int U = *(yuv_data + u_offset); int V = *(yuv_data + v_offset); *(rgb_data + 3 * i) = table_r[(Y << 8) + V]; *(rgb_data + 3 * i + 1) = table_g[(Y << 16) + (U << 8) + V]; *(rgb_data + 3 * i + 2) = table_b[(Y << 8) + U]; }}
小结
本文旨在揭示CUDA的硬件框架,通过对比实验发现硬件解码还是强大的,GTX970能够做到720p视频大约800fps的速度。我也是基于此框架,实现了一套基于cuda的多路视频硬件解码C++接口,输出opencv mat格式图像,做后续视频分析。
由于时间关系,行文较为仓促,错误或者讲的不清楚的地方,大家可以给我留言。
- CUDA和FFMPEG硬件解码视频流
- 嵌入式Linux下基于FFmpeg的视频硬件编解码
- 嵌入式Linux下基于FFmpeg的视频硬件编解码
- 嵌入式Linux下基于FFmpeg的视频硬件编解码
- FFMPEG 硬件解码
- FFMPEG 硬件解码
- FFMPEG 硬件解码
- FFMPEG 硬件解码
- FFMPEG 硬件解码
- FFMPEG 硬件解码
- FFMPEG 硬件解码
- FFMPEG 硬件解码
- FFMPEG 硬件解码
- FFmpeg解码视频和SDL显示
- 使用FFmpeg+GDI解码和播放视频
- 视频硬件编解码
- FFmpeg实时解码MediaRecorder视频流
- 利用ffmpeg 解码ps网络流视频
- 滚动图片
- 重新编译python报错
- LCS最长公共子序列
- record-mr-hive
- 文章标题
- CUDA和FFMPEG硬件解码视频流
- Linux系统线上问题定位
- Arduino重置-复位
- 最大公约数,最小公倍数
- Layered windows and UpdateLayeredWindow 分层窗口
- iOS 币行(BTC 、LTC、LTH、ETC) 简介
- 623. Add One Row to Tree
- 设计模式六大原则---单一职责原则
- FFmpeg工具介绍