基于可变数据压缩的GPU核辅助加速策略
来源:互联网 发布:windows tools 编辑:程序博客网 时间:2024/06/06 05:57
A report brief about paper A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps.
A comprehensive design and evaluation.
Off-chip memory bandwidth is one of the bottleneck of GPU execution, which makes the GPU computational resources idle. The paper introduces Core-Assisted Bottleneck Acceleration CABA framework to make use of unutilized on-chip resources to solve the idle problem mentioned.
CABA employs hardware available on-chip but underutilized, and offers versatility in algorithm choice(hardware-based) for different applications, once the application can't benefits from compression, CABA can easily disable the compression. Unutilized compute resources come from Compute, Memory and Data Dependence Stalls. And unutilized on-chip memory is limited by the available registers and shared memory, the hard limit on the number of threads and thread blocks per core, the number of thread blocks in the occupancy. The helper thread of CABA is low overhead, and need to be treated differently from regular threads. To implement low overhead, helper thread is easy to be managed, to enable, trigger and kill threads, and it should be flexible enough to adapt to the runtime behavior of the regular program, while communicating with original thread.
Assist warps compress data,executing code to speed up application execution, and shares the same context as the regular warp to simplify scheduling and data communication. Assist warps compress cache blocks before written to memory, and decompress before cache blocks placed into cache.
The CABA Framework is based on hardware/software co-design, with pure software only will have high overhead and with pure hardware would make register allocation and data communication more difficult. In hardware level, sequences of instructions are dynamically inserted into the execution stream.The author track and manage of the instructions at the granularity of a warp, which was called Assist Warps. The assist warps does not own a separate context, but shares both a context and a warp ID with regular warp. For different actions, helper thread requires a different number of registers, which have a short lifetime. And the subroutine of the assist warp can be written both by CUDA extensions with PTX instructions or the microarchitecture in the internal GPU instruction format. There are three main hardware additions, Assist Warp Store, Assist Warp COntroller and Assist Warp Buffer.
To compress the data, Base-Delta-Immediate compression BDI is used. BDI represents a cache line with low dynamic range using a common base (or multiple bases) and an array of deltas. The author views a cache line as a set of fixed-size values, and decompression is simply a masked vector addition of the deltas to the appropriate bases.
The use of CABA or memory compression improves system performance about 41.7% on average on a set of bandwidth-sensitive GPU applications.
A comprehensive design and evaluation.
Off-chip memory bandwidth is one of the bottleneck of GPU execution, which makes the GPU computational resources idle. The paper introduces Core-Assisted Bottleneck Acceleration CABA framework to make use of unutilized on-chip resources to solve the idle problem mentioned.
CABA employs hardware available on-chip but underutilized, and offers versatility in algorithm choice(hardware-based) for different applications, once the application can't benefits from compression, CABA can easily disable the compression. Unutilized compute resources come from Compute, Memory and Data Dependence Stalls. And unutilized on-chip memory is limited by the available registers and shared memory, the hard limit on the number of threads and thread blocks per core, the number of thread blocks in the occupancy. The helper thread of CABA is low overhead, and need to be treated differently from regular threads. To implement low overhead, helper thread is easy to be managed, to enable, trigger and kill threads, and it should be flexible enough to adapt to the runtime behavior of the regular program, while communicating with original thread.
Assist warps compress data,executing code to speed up application execution, and shares the same context as the regular warp to simplify scheduling and data communication. Assist warps compress cache blocks before written to memory, and decompress before cache blocks placed into cache.
The CABA Framework is based on hardware/software co-design, with pure software only will have high overhead and with pure hardware would make register allocation and data communication more difficult. In hardware level, sequences of instructions are dynamically inserted into the execution stream.The author track and manage of the instructions at the granularity of a warp, which was called Assist Warps. The assist warps does not own a separate context, but shares both a context and a warp ID with regular warp. For different actions, helper thread requires a different number of registers, which have a short lifetime. And the subroutine of the assist warp can be written both by CUDA extensions with PTX instructions or the microarchitecture in the internal GPU instruction format. There are three main hardware additions, Assist Warp Store, Assist Warp COntroller and Assist Warp Buffer.
To compress the data, Base-Delta-Immediate compression BDI is used. BDI represents a cache line with low dynamic range using a common base (or multiple bases) and an array of deltas. The author views a cache line as a set of fixed-size values, and decompression is simply a masked vector addition of the deltas to the appropriate bases.
The use of CABA or memory compression improves system performance about 41.7% on average on a set of bandwidth-sensitive GPU applications.
0 0
- 基于可变数据压缩的GPU核辅助加速策略
- 基于cuda的gpu加速
- 【Python - GPU】基于Python的GPU加速并行计算 -- pyCUDA
- Win7基于theano的keras安装及GPU加速
- 基于网络的数据压缩
- 测试gpu的加速比
- 利用GPU加速的软件
- 什么是 GPU 加速的计算?
- 利用GPU加速的软件
- Matlab编程的GPU加速
- 基于哈夫曼树的数据压缩算法
- Lumion3D 基于GPU加速的可视化创作工具(商业版+最新中英文教程)
- 闪客工具:基于Flash 3D API Molehill 进行GPU加速2D的引擎
- Win7配置CUDA并搭建基于Theano框架的GPU加速环境
- 【学习笔记-图像拼接】基于GPU加速的大尺寸图像拼接
- Win7配置CUDA并搭建基于Theano框架的GPU加速环境
- Win7配置CUDA并搭建基于Theano框架的GPU加速环境
- Win7配置CUDA并搭建基于Theano框架的GPU加速环境
- hibernate配置C3P0详解
- Android——仿ios底部弹出选择框
- Java资源绑定(ResourceBundle)示例
- sumproduct多条件求和经典问题(乘号和逗号)剖析
- 计算机语言和计算机程序
- 基于可变数据压缩的GPU核辅助加速策略
- 公告板
- oracle数据库对象---索引
- jQuery框架常用的性能优化
- 蛇形填数-预判
- Error: Cannot retrieve metalink for repository: epel. Please verify its path and try again
- SSH学习之Spring问题总结
- 计算每天的增量,增率
- Intent的Flag简单实用