数据打包技术

来源：互联网发布：php是什么文件格式编辑：程序博客网时间：2024/06/07 02:52

转载自http://dsalli0927.blog.163.com/blog/static/888076072008715535584/

C6000访问存储器是很费时的，要提高C6000的数据处理率，应该使1个Load/Store指令能够访问多个数据。当程序需要对一连传的短型数据进行操作时，可使用字（整型）一次访问2个短型数据；然后用C6000的相应指令，如同时进行2个16位的加法指令，用_add2()对这些数据进行运算，以减少对内存的访问。类似的，对于C64,如需要对一连串整形数据进行操作时，可以使用双字长访问存储器。这种类型的优化就叫做：数据打包技术。

如用字访问代替2个16位短型数据的访问

void vecsum4(short *restrict sum, restrict short *in1, restrict short *in2,unsigned N)

{

int i;

#pragma MUST_ITERATE(10);

for(i=0;i<N;i++)

_amem4(&sum[i])=add2(_amem4_const(&in1[i]）,_amem4_const(&in2[i]));

}

说明：

#pragma MUST_ITERATE(10)说明下面的循环至少要执行10次。这个信息对软件流水至关重要。

_amem4.这类intrinsics指定了每次存储器访问的字节数，并说明存储器起始地址是否必须符合边界调整。amem4(&sum[i])告诉编译器：这是一个起始地址在sum、字边界调整的4字节访问。_amem4_const(&in1[i]）增加了const关键字，它表示in1[i]是常数数组，在本程序中数值不变。

上例子是假设执行偶数次循环，如果用于奇数次循环，可以采取一些技巧！

例如：把数组的长度人为增加，使它仍执行偶数次。如果要求程序满足不同次数循环的要求，或者要求满足数组起始地址可能是短型数据边界等多种情况，较好的办法是在程序内部检测一下传递过来的数据情况，根据不同的数据情况采取不同的程序段：

例子：通用的求矢量和的程序

void vecsum5(short *restrict sum, const short *restrict in1,short *restrict in2,unsigned int N)

{

int i;

/* test to see if sum ,in2 and in1 are aligned to a word boundary*/

if(((int)sum| (int)in2 |(int) in1) &0x02)

{

#pragma MUST_ITERATE(20);

for(i=0;i<N;i++)

sum[i]=in1[i]+in2[i];

}

else

{

#pragma MUST_ITERATE(10);

for(i=0;i<N;i++)

_amem4(&sum[i])=add2(_amem4_const(&in1[i]）,_amem4_const(&in2[i]));

if(N&0x01)sum[i]=in1[i]+in2[i];

}

/////////////////////////////////////////////////////////////////////////

The following example shows an example that can benefit from the packed compare and expand intrinsics in action. The Clear Below Threshold kernel scans an image of 8-bit unsigned pixels, and sets all pixels that are below a certain threshold to 0.
Clear Below Threshold Kernel

void clear_below_thresh(unsigned char *restrict image, int count, unsigned char threshold)

{

int i;

for (i = 0; i < count; i++)

{

if (image[i] <= threshold) image[i] = 0;

}

}
Vectorization techniques are applied to the code (as described Packed-Data Processing on the C64x), giving the result shown in the following example. The _cmpgtu4() intrinsic compares against the threshold values, and the _xpnd4() intrinsic generates a mask for setting pixels to 0. Note that the new code has the restriction that the input image must be double-word aligned, and must contain a multiple of 8 pixels. These restrictions are reasonable as common image sizes have a multiple of 8 pixels.

Clear Below Threshold Kernel, Using _cmpgtu4 and _xpnd4 Intrinsics

void clear_below_thresh(unsigned char *restrict image, int count, unsigned char threshold)

{

int i;

unsigned t3_t2_t1_t0; /* Threshold (replicated) */

unsigned p7_p6_p5_p4, p3_p2_p1_p0; /* Pixels */

unsigned c7_c6_c5_c4, c3_c2_c1_c0; /* Comparison results */

unsigned x7_x6_x5_x4, x3_x2_x1_x0; /* Expanded masks */

/* Replicate the threshold value four times in a single word */ unsigned temp = _pack2(threshold, threshold);

t3_t2_t1_t0 = _packl4(temp, temp);

for (i = 0; i < count; i += 8)

{

/* Load 8 pixels from input image (one double-word). */

p7_p6_p5_p4 = _hi(_amemd8(&image[i]));

p3_p2_p1_p0 = _lo(_amemd8(&image[i]));

/* Compare each of the pixels to the threshold. */

c7_c6_c5_c4 = _cmpgtu4(p7_p6_p5_p4, t3_t2_t1_t0);

c3_c2_c1_c0 = _cmpgtu4(p3_p2_p1_p0, t3_t2_t1_t0);

/* Expand the comparison results to generate a bitmask. */

x7_x6_x5_x4 = _xpnd4(c7_c6_c5_c4);

x3_x2_x1_x0 = _xpnd4(c3_c2_c1_c0);

/* Apply mask to the pixels. Pixels that were less than or */

/* equal to the threshold will be forced to 0 because the */

/* corresponding mask bits will be all 0s. The pixels that */

/* were greater will not be modified, because their mask */

/* bits will be all 1s. */

p7_p6_p5_p4 = p7_p6_p5_p4 & x7_x6_x5_x4; p3_p2_p1_p0 = p3_p2_p1_p0 & x3_x2_x1_x0;

/* Store the thresholded pixels back to the image. */

_amemd8(&image[i]) = _itod(p7_p6_p5_p4, p3_p2_p1_p0);

}