基于可变数据压缩的GPU核辅助加速策略

来源：互联网发布：windows tools 编辑：程序博客网时间：2024/06/06 05:57

A report brief about paper A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps.
A comprehensive design and evaluation.
Off-chip memory bandwidth is one of the bottleneck of GPU execution, which makes the GPU computational resources idle. The paper introduces Core-Assisted Bottleneck Acceleration CABA framework to make use of unutilized on-chip resources to solve the idle problem mentioned.

CABA employs hardware available on-chip but underutilized, and offers versatility in algorithm choice(hardware-based) for different applications, once the application can't benefits from compression, CABA can easily disable the compression. Unutilized compute resources come from Compute, Memory and Data Dependence Stalls. And unutilized on-chip memory is limited by the available registers and shared memory, the hard limit on the number of threads and thread blocks per core, the number of thread blocks in the occupancy. The helper thread of CABA is low overhead, and need to be treated differently from regular threads. To implement low overhead, helper thread is easy to be managed, to enable, trigger and kill threads, and it should be flexible enough to adapt to the runtime behavior of the regular program, while communicating with original thread.

Assist warps compress data,executing code to speed up application execution, and shares the same context as the regular warp to simplify scheduling and data communication. Assist warps compress cache blocks before written to memory, and decompress before cache blocks placed into cache.

The CABA Framework is based on hardware/software co-design, with pure software only will have high overhead and with pure hardware would make register allocation and data communication more difficult. In hardware level, sequences of instructions are dynamically inserted into the execution stream.The author track and manage of the instructions at the granularity of a warp, which was called Assist Warps. The assist warps does not own a separate context, but shares both a context and a warp ID with regular warp. For different actions, helper thread requires a different number of registers, which have a short lifetime. And the subroutine of the assist warp can be written both by CUDA extensions with PTX instructions or the microarchitecture in the internal GPU instruction format. There are three main hardware additions, Assist Warp Store, Assist Warp COntroller and Assist Warp Buffer.

To compress the data, Base-Delta-Immediate compression BDI is used. BDI represents a cache line with low dynamic range using a common base (or multiple bases) and an array of deltas. The author views a cache line as a set of fixed-size values, and decompression is simply a masked vector addition of the deltas to the appropriate bases.

The use of CABA or memory compression improves system performance about 41.7% on average on a set of bandwidth-sensitive GPU applications.

0 0