Parallel Reduction --- (2) Remove Unnecessary Modular Arithmetic
来源:互联网 发布:一角书屋知乎 编辑:程序博客网 时间:2024/05/29 18:05
Abstract
This blog will try to improve the performance of previous reduction algorithm. Specifically, the strategy of removing unnecessary modular arithmetic will be discussed.
1. Modular arithmetic
The operation of modular arithmetic is costly referring to wiki. To avoid that problem, we may think that could it possible for us to implement our algorithm in a more efficient way?
2. Another Way to Implement
Sure in this case, the answer is yes. We reorganize our key codes as follows.
// reduction for (int i=1; i < 1024; i *= 2){ int ntid = 2 * i * tid; if(ntid < 1024){ data[ntid] += data[ntid + i]; } __syncthreads(); }
A operation diagram is presented to explain these codes figuratively as follows.
3. Experimental Results
The experimental results show that the total time CUDA kernel used is 12.304 ms, which is faster than the previous implementation.
4. More
The source code can be visit on github.
- Parallel Reduction --- (2) Remove Unnecessary Modular Arithmetic
- parallel reduction
- Modular arithmetic
- Unit 2-Lecture 2: Modular Arithmetic
- Parallel Reduction --- (0) Intro
- Codeforces Round #334 (Div. 2) D. Modular Arithmetic(置换)
- Parallel Reduction --- (1) Original Implementation
- Parallel Reduction --- (3) Free Strides
- Parallel Reduction --- (4) Free Loops
- Lesson 8 Basic arithmetic reduction operations
- Codeforces Round #334 (Div. 2) 604D Modular Arithmetic(数学+快速幂)
- Remove unnecessary magnetic soft iron calibration parameters on msm8976/8956
- 图像分析中常见的使用Conditional Remove的优化类型(2):Array Reduction
- CUDA中并行规约(Parallel Reduction)的优化
- openmp 快速入门 常用技巧 parallel for sections reduction critical
- parallel reduction 并行规约,unroll last warp 同步问题
- CUDA中并行规约(Parallel Reduction)的优化
- CUDA中并行规约(Parallel Reduction)的优化
- Hibernate—session与transaction
- iOS 开发 深入浅出Runtime运行时之官方指南翻译--runtime介绍
- 如何在Linux云服务器上搭建Xampp
- 初学结构体
- ZCMU—1067
- Parallel Reduction --- (2) Remove Unnecessary Modular Arithmetic
- Android系统源代码编译—[4]运行构建
- 最小割的一些性质和理解
- go的 response 学习
- eclipse 中项目打包成jar以及exe文件
- JS 原型理解
- html笔记
- the SetStack computer
- jottings-ubuntu16.04 lts的完整克隆