Thrust快速入门教程（三） —— Algorithms

来源：互联网发布：知柏地黄丸治疗失眠吗编辑：程序博客网时间：2024/05/17 02:44

　　Thrust提供了丰富的常用并行算法。这算法的功能与STL中的非常相似，于是我们使用了相同的名称（例如thrust::sort与std::sort）。

　　所有的Thrust算法均提供了host端（主机端）和device端（设备端）的实现。当传入迭代器指向主机端时，将会调用主机端方法，当使用迭代器指向设备端时将调用设备端实现。

　　在Thrust的算法中，的迭代器参数指向必须在同一存储位置，或者主机端或者设备端。除了thrust::copy，它可以任意的拷贝主机端和设备端的数据。

Transformations

　　Transformations会对指定输入范围内的元素进行特定操作，并将其结果存储到指定位置。如thrust::fill

thrust::fill(D.begin(), D.begin() + 7, 9); // 将vector D的前7个元素设置为9

　　此外transformations还包括thrust::sequence、thrust::replace、thrust::transform（衍生于C++的for_each）。完整的列表请参考文档。

　　下面的代码演示了几个transformation的用法。其中thrust::negate和thrust::modulus与C++中的仿函数定义是一样的。Thrust在thrust/functional.h中也提供常用的仿函数，如plus与multiplies等。

#include <thrust/device_vector.h>#include <thrust/transform.h>#include <thrust/sequence.h>#include <thrust/copy.h>#include <thrust/fill.h>#include <thrust/replace.h>#include <thrust/functional.h>#include <iostream>int main(void){  // allocate three device_vectors with 10 elements  thrust::device_vector<int> X(10);  thrust::device_vector<int> Y(10);  thrust::device_vector<int> Z(10);  // initialize X to 0,1,2,3, ....  thrust::sequence(X.begin(), X.end());  // compute Y = -X  thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());  // fill Z with twos  thrust::fill(Z.begin(), Z.end(), 2);  // compute Y = X mod 2  thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<int>());  // replace all the ones in Y with tens  thrust::replace(Y.begin(), Y.end(), 1, 10);  // print Y  thrust::copy(Y.begin(), Y.end(), std::ostream_iterator<int>(std::cout, "\n"));  return 0;    }

　　thrust/fuctuional.h中的函数提供了大部分内置算数和比较运算，但是我们想提供更多出色的功能。
　　
　　比如，计算y < - a * x + y，x、y为向量，a为常数。这其实就是我们所熟知的由BLAS提供的SAXPY（a*x+y）运算。如果我们在thrust中实现SAXPY我们有几个选择。
　　　　
　　一个是，我们需要使用两个transformations（一个加和一个乘法），还需要一个vector用于存储a的值。另一更佳选择是使用一个单独的由用户自己定义的transformation，这才是我们真正先要的。下面用源代码解释说明这两种方法。

struct saxpy_functor{  const float a;  saxpy_functor(float _a) : a(_a) {}  __host__ __device__  float operator()(const float& x, const float& y) const  {     return a * x + y;  }};void saxpy_fast(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y){  // Y <- A * X + Y  thrust::transform(X.begin(), X.end(), Y.begin(), Y.begin(), saxpy_functor(A));}void saxpy_slow(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y){  thrust::device_vector<float> temp(X.size());  // temp <- A  thrust::fill(temp.begin(), temp.end(), A);  // temp <- A * X  thrust::transform(X.begin(), X.end(), temp.begin(), temp.begin(), thrust::multiplies<float>());  // Y <- A * X + Y  thrust::transform(temp.begin(), temp.end(), Y.begin(), Y.begin(), thrust::plus<float>());}

　　saxpy_fast和saxpy_slow都是有效的SAXPY实现，但saxpy_fast会比saxpy_slow更快。忽略temp vector的分配与代数运算的开销，内存读写的开销如下：

fast_saxpy： 2N次读取和 N次写入
slow_saxpy：4N次读取和3N次写入

　　因为SAXPY受到存储器约束（它的性能受限于存储器带宽，而不是浮点计算性能），大量的读写操作使得saxpy_slow开销更加昂贵。而saxpy_fast执行速度与BLAS优化版本中的实现一样快。存储器约束算法通常使用kernel fusion（即融合，将多个操作组合为单一的kernel）的方法以最小化内存的读写交换。

　　thrust::transform只支持一个或者两个输入参数的transformations（例如 f(x) -> y 和 f(x, y) -> z）。当transformation使用多于两个输入参数的时候需要使用其他方法了。

　　Arbitrary_transformation展示了使用thrust::zip_interator和thrust::for_each的解决方案。

Reductions

　　归约算法使用二元运算（operator）将输入序列规约为一个值。例如，需要获得一数列的和，可以通过“加运算”规约得到；数列的最大值，可以通过两个参数的“最大值运算”规约得到。

　　数列求和的规约操作可以由thrust::reduce实现：

int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<int>());

　　开始的两个参数定义了需要规约数组的范围，第三和第四个参数分别提供了归约的初始值和相关的operator。事实上，归约求和是很常见的，我们可以选择没有初始值和operator的版本。

int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<int>());int sum = thrust::reduce(D.begin(), D.end(), (int) 0);int sum = thrust::reduce(D.begin(), D.end());

　　虽然thrust::reduce能够有效地满足大部分的规约操作，但是Thrust库提供了另外的一些函数以便使用（类似于STL）。例如，thrust::count能够返回给定序列中特定值的数量。

#include <thrust/count.h>#include <thrust/device_vector.h>...// put three 1s in a device_vectorthrust::device_vector<int> vec(5,0); //vec包含5个值为0的元素vec[1] = 1;vec[3] = 1;vec[4] = 1;// count the 1s // 计算1的数量int result = thrust::count(vec.begin(), vec.end(), 1);// result is three // result = 3

　　其他规约操作，包括thrust::count_if、thrust::min_element、thrust::max_element、thrust::is_sorted、thrust::inner_product等，详细请参考文档。

　　前面SAXPY例子使用的transformation kernel展示了如何通过融合来减少内存交换。我们也可以使用thrust::transform_reduce实现与归约算法的融合。下面的例子用来计算向量的模：

#include <thrust/transform_reduce.h>#include <thrust/functional.h>#include <thrust/device_vector.h>#include <thrust/host_vector.h>#include <cmath>// square<T> computes the square of a number f(x) -> x*xtemplate <typename T>struct square{  __host__ __device__  T operator()(const T& x) const  {     return x * x;  }};int main(void){  // initialize host array  float x[4] = {1.0, 2.0, 3.0, 4.0};  // transfer to device  thrust::device_vector<float> d_x(x, x + 4); //利用x来初始化d_x  // setup arguments  square<float>        unary_op;   //一元运算  thrust::plus<float> binary_op;  //二元运算  float init = 0;  // compute norm  float norm = std::sqrt( thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op) );  std::cout << norm << std::endl;  return 0;}

　　这里平方是一元运算，将输入序列的每个元素平方。接着，平方和使用标准的加法规约得到。
　　当然我们也可以像慢版本的SAXPY transformation实现这个功能：首先用自己定义的平方运算（或者直接使用仿函数multiplies）计算出平方后的值，最后使用归约求和。但是显然这样做会带来不必要的浪费，速度降低。通过融合规约和平方操作，我们就可以获得与自己编写内核一样的高性能。

Prefix-Sums

　　前缀和，也叫scan操作，在stream compaction、基数排序中具有重要的作用。下面的源码将举例说明使用默认加法的inclusive scan：

#include <thrust/scan.h>int data[6] = {1, 0, 2, 2, 1, 3};thrust::inclusive_scan(data, data + 6, data);  // in-place scan ，in-place指的是输入输出的位置一样的// data is now {1, 1, 3, 5, 6, 9}

　　Inclusive scan的每个输出元素为输入数列的相应部分和。例如，data[2] = data[0] + data[1] + data[2]。Exclusive scan类似，但是右移一个位置：

#include <thrust/scan.h>int data[6] = {1, 0, 2, 2, 1, 3};thrust::exclusive_scan(data, data + 6, data); // in-place scan// data is now {0, 1, 1, 3, 5, 6}

　　现在为data[2] = data[0] + data[1]。由例子可见，inclusive_sacn与exclusive_scan允许原址操作。

　　Thrust也提供了函数transform_inclusive_scan与transform_exclusive_scan可以实现在scan操作前对输入数列进行一元运算。完整的scan说明请参见文档。
　　

Reordering

　　Thrust通过下面的算法为partitioning和stream compaction提供支持：

copy_if：保留Predicate的结果中为真的元素。
partition：根据Predicate的结果重新排列元素(真值对应的元素在前，其余元素在后)。
remove/remove_if：删除Predicate的结果中为真的元素。
unique：移除连续重复的元素。

注：Predicate是一个一元函数，用来判断真或假。例如，函数is_positive()用来判断是否是正数。

#include <thrust/partition.h>...struct is_even{  __host__ __device__  bool operator()(const int &x)  {    return (x % 2) == 0;  }};...int A[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};const int N = sizeof(A)/sizeof(int);thrust::partition(A, A + N,is_even()); //偶数在前，奇数在后// A is now {2, 4, 6, 8, 10, 1, 3, 5, 7, 9}

　　完整的函数和例子参见文档。

Sorting

　　Thrust提供一系列函数，可以根据既定的准则来分类数据和重排数据。thrust::sort和thrust::stable_sort函数对应STL中的sort和stable_sort。

#include <thrust/sort.h>...const int N = 6;int A[N] = {1, 4, 2, 8, 5, 7};thrust::sort(A, A + N);// A is now {1, 2, 4, 5, 7, 8}

　　另外，Thrust提供thrust::sort_by_key和thrust::stable_sort_by_key。

#include <thrust/sort.h>...const int N = 6;int    keys[N] = {  1,   4,   2,   8,   5,   7};char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};thrust::sort_by_key(keys, keys + N, values);// keys is now   {  1,   2,   4,   5,   7,   8}// values is now {'a', 'c', 'b', 'e', 'f', 'd'}

　　跟STL类似，sorting函数同样接受用户自定义对比操作。

#include <thrust/sort.h>#include <thrust/functional.h>...const int N = 6;int A[N] = {1, 4, 2, 8, 5, 7};thrust::stable_sort(A, A + N, thrust::greater<int>());// A is now {8, 7, 5, 4, 2, 1}

参考：

Thrust快速入门教程（三）
thrust: Modules
【C++ STL】深入解析神秘的 — 仿函数 - 小田的专栏 - 博客频道 - CSDN.NET
GPU Gems 3 - Chapter 39. Parallel Prefix Sum (Scan) with CUDA

0 0