Parallel Reduction --- (4) Free Loops
来源:互联网 发布:马天宇女装知乎 编辑:程序博客网 时间:2024/05/20 04:48
Abstract
This blog will talk about how to free unnecessary loops on CUDA codes.
1. Free Loops on CPU
When talking about “unrolling” or “free loops”, the first thing coming up to my mind is “#pragma unroll”, a compiler optimization instruction. From Wiki we can know that free loops is a technique that attempts to optimize a program’s execution speed at the expense of its binary size. For example, the original codes like this
for(int i=0;i<4;i++){ cout<<"hello world"<<endl;}
After unrolling the unnecessary loop-control instruction in pursuit of a higher performance while unavoidably costing more memory space, then the codes are like this
cout<<"hello world"<<endl;cout<<"hello world"<<endl;cout<<"hello world"<<endl;cout<<"hello world"<<endl;
Remember that there is no a free lunch. Free loops is an approach known as the space-time tradeoff.
2. Free Loops on GPU
The key codes of reduction on CUDA are shown as follows.
// reduction if(tid < 512){ data[tid] += data[tid + 512]; } __syncthreads(); if(tid < 256){ data[tid] += data[tid + 256]; } __syncthreads(); if(tid < 128){ data[tid] += data[tid + 128]; } __syncthreads(); if(tid < 64){ data[tid] += data[tid + 64]; } __syncthreads(); if(tid < 32){ data[tid] += data[tid + 32]; data[tid] += data[tid + 16]; data[tid] += data[tid + 8]; data[tid] += data[tid + 4]; data[tid] += data[tid + 2]; data[tid] += data[tid + 1]; }
3. Experimental Results
A further improved performance can be made by freeing or unrolling loops. That’s really cool, isn’t it?
4. More details
The CUDA-based reduction codes can be viewed on Github.
- Parallel Reduction --- (4) Free Loops
- Parallel Reduction --- (3) Free Strides
- parallel reduction
- Parallel Reduction --- (0) Intro
- Embarrassingly parallel for loops
- Parallel Reduction --- (1) Original Implementation
- Parallel Reduction --- (2) Remove Unnecessary Modular Arithmetic
- CUDA中并行规约(Parallel Reduction)的优化
- openmp 快速入门 常用技巧 parallel for sections reduction critical
- parallel reduction 并行规约,unroll last warp 同步问题
- CUDA中并行规约(Parallel Reduction)的优化
- CUDA中并行规约(Parallel Reduction)的优化
- Week3-4Dimensionality reduction
- reduction
- Dimensionality Reduction(学习Free Mind知识整理)
- loops
- Parallel Reduction --- (5) Question: How Many Threads on Earth We Need?
- Parallel
- Manacher算法
- 11. Container With Most Water
- 关于Https安全性问题、双向验证防止中间人攻击问题
- 多态小案例
- ubuntu git安装及使用
- Parallel Reduction --- (4) Free Loops
- MethodSwizzling
- string拼接字符串stringbuider和stringbuffer的区别
- Masonry多个控件的等间隔排序显示
- 梁启超家书致思成书-安之若命
- 字符设备三个重要的结构体
- Java动态代理--CGLib实现
- 孩子们的游戏
- CodeForces 359B 构造