利用OpenCL对OpenCV并行化心得（1）

来源：互联网发布：广告动画制作软件编辑：程序博客网时间：2024/06/03 21:04

做了一年多的opencv并行化，总算小有成就了，马上软件就要发布，也该是一个总结的时候了，只可惜文笔不行，人也懒，只好随便写写，算是给自己做点交代吧。

opencv里面最重要的一种类型恐怕非8UC1莫属，灰度图一般都用这种类型表示，而且大多数算法只提供这种数据类型的支持，所以8UC1的优化就成了整个工作的重中之重。以matrix add为例，最简单的kernel的写法是这样的

__kernel void matrix_add(__global uchar* src1,__global uchar* src2, __global uchar* dst, int rows, int cols,int src1_step,int src2_step,int dst_step)

{

int x=get_global_id(0);

int y=get_global_id(1);

if(x<cols&&y<rows)

dst[mad24(y,dst_step,x)]=src1[mad24(y,src1t_step,x)]+src2[mad24(y,src2_step,x)];

}

这种写法非常简洁，但是不够高效，没有充分利用GPU的内存带宽，存在很大的浪费，为了充分利用GPU的内存带宽，至少每次要写出4byte，所以kernel就变成这样

__kernel void matrix_add(__global uchar* src1,__global uchar* src2, __global uchar* dst, int rows, int cols,int src1_step,int src2_step,int dst_step)

{

int x=get_global_id(0)<<2;

int y=get_global_id(1);

if(x<cols&&y<rows)

*(__global ucha4*)(dst+mad24(y,dst_step,x))=*(__global ucha4*)(src1+mad24(y,src1t_step,x))+*(__global ucha4*)(src2+mad24(y,src2_step,x));

}

这样效率会提高四倍左右，但是可以这样写的前提是没有ROI问题，如果有ROI情况会复杂很多