CUDA并行规约(相邻配对)
来源:互联网 发布:可视化网页编辑软件 编辑:程序博客网 时间:2024/06/01 15:12
#include "cuda_runtime.h"#include "device_launch_parameters.h"#include <stdio.h>#include "math.h"#include "stdlib.h"//错误检查的宏定义#define CHECK(call)\{\const cudaError_t status=call;\if (status!=cudaSuccess)\{\printf("文件:%s,函数:%s,行号:%d",__FILE__,\__FUNCTION__,__LINE__);\printf("%s", cudaGetErrorString(status));\exit(1);\}\}\//核函数__global__ void Kernel(int *d_data, int *d_local_sum, int N){int tid = threadIdx.x;int index = blockIdx.x*blockDim.x + threadIdx.x;int *data = d_data + blockIdx.x*blockDim.x;if (index >= N) return;for (int strize = 1; strize < blockDim.x; strize *= 2){if (tid % (2 * strize) == 0){data[tid] += data[tid + strize];}__syncthreads();}if (tid == 0){d_local_sum[blockIdx.x] = data[0];}}//主函数int main(){//基本参数设置cudaSetDevice(0);const int N =65536;int local_length =8;int total_sum = 0;dim3 grid(((N + local_length - 1) / local_length), 1);dim3 block(local_length, 1);int *h_data = nullptr;int *h_local_sum = nullptr;int *d_data = nullptr;int *d_local_sum = nullptr;//Host&Deivce内存申请及数组初始化h_data = (int*)malloc(N * sizeof(int));h_local_sum = (int*)malloc(int(grid.x) * sizeof(int));CHECK(cudaMalloc((void**)&d_data, N * sizeof(int)));CHECK(cudaMalloc((void**)&d_local_sum, int(grid.x) * sizeof(int)));for (int i = 0; i < N; i++)h_data[i] = int(10 * sin(0.02*3.14*i));//限制数组元素值,防止最终求和值超过int的范围//数据拷贝至DeviceCHECK(cudaMemcpy(d_data, h_data, N * sizeof(int), cudaMemcpyHostToDevice));//执行核函数Kernel << <grid, block >> > (d_data, d_local_sum, N);//数据拷贝至HostCHECK(cudaMemcpy(h_local_sum, d_local_sum, int(grid.x) * sizeof(int),cudaMemcpyDeviceToHost));//同步&重置设备CHECK(cudaDeviceSynchronize());CHECK(cudaDeviceReset());for (int i = 0; i < int(grid.x); i++){total_sum += h_local_sum[i];}printf("%d \n", total_sum);getchar();return 0;}
阅读全文
0 0
- CUDA并行规约(相邻配对)
- CUDA并行规约(相邻配对-优化)
- CUDA并行规约(交错配对)
- CUDA并行规约(交错配对-展开规约)
- CUDA并行规约(交错配对-展开线程)
- CUDA并行规约(交错配对-完全展开-终极版)
- 理解cuda并行程序的规约思想
- 理解CUDA并行程序的规约思想
- 理解CUDA并行程序的规约思想 .
- CUDA并行算法系列之规约
- 多线程并行数组求和(相邻配对模式)
- CUDA中并行规约(Parallel Reduction)的优化
- CUDA中并行规约(Parallel Reduction)的优化
- CUDA中并行规约(Parallel Reduction)的优化
- CUDA性能调优(二)—并行规约及优化
- CUDA Thrust 规约求和
- CUDA规约前缀求和问题
- CUDA 规约计算性能对比
- AtCoder Beginner Contest 081 D
- 机器学习笔记(0)-统计学习方法与感知机
- 搭建个人博客
- Fiddler实战之http[s]流量分析
- 对象的销毁
- CUDA并行规约(相邻配对)
- mysql查询数据表中某字段重复的数据
- 盒子模型
- 给定一个没有重复的已排序整数数组,返回其范围的摘要。
- LMAX Disruptor——一个高性能、低延迟且简单的框架
- 【论文笔记】物体检测系列 SSD: Single Shot MultiBox Detector
- 深度学习之主流数据库 | MySQL基础
- 文件及文件夹的重命名-- python实现
- (五)数组(Array)