大话BSGP: Bulk-Synchronous GPU Programming

来源：互联网发布：橙星管培生知乎编辑：程序博客网时间：2024/06/06 05:48

这是周昆老师与侯启明老师一起做地工作，2008 SIGGRAPH，学习~~~关键是思想，好不好，为什么，怎么样呢？周昆老师的主页是：http://www.kunzhou.net/ 了解BSGP更多详细信息。

1. BSGP简介

BSGP是一种新的GPU编程语言，基于BSP(Bulk synchronous parallel)模型。看起来就像是顺序的C程序，程序员只需敲很少的并行代码，易读、易写、易维护。它的“简洁”并没有牺牲性能，编译器承担了从BSGP到Kernel的转化并提供了优化分配的temp stream。多个用例证明：BSGP VS CUDA,性能持平或更佳，但代码复杂度和编程时间都大为缩减。

1.1 传统编程模型 VS. BSGP

现存的GPU编程语言都是流处理模型（GPU是流处理模式），像Brook和CUDA。流处理是数据data-centric模型（数据元素被组织成流），Kernel函数并行地对输入流操作以得到输出流，这种Stream/Kernel模式的底层数据依赖很强。现在的流编程模型难呀。。。

BSGP的好处：首先，程序员学习曲线低了，类C程序。（引入概念：barrier是BSGP代码中的同步语法，下有详细介绍；superstep是两个barrier之间的语句，1个superstep被编译器译作1个Kernel，然后被一堆线程并行执行，注意所有的superstep以barrier为界是串行执行地）同时，BSGP将程序员从繁冗的temporary stream management中解脱出来。再者，数据依赖被隐式做了（local变量可见且在supersteps间共享）。最后，BSGP使得parallel primitives(e.g. reduce/scan/sort)变地简单，代码重用非常方便。

1.2 编译器设计不得不考虑的问题

BSGP编程模型不是直接映射到GPU流处理模型上，我们需要一个编译器把BSGP程序转化为流程序。设计这个编译器，需要考虑这样几个问题：

a. barrier synchronizatioin

GPU上跑着成千上万的线程，谁得空（处理单元的空）谁跑，而且像register这样存储上下文的资源一旦线程结束就回收。“物理处理单元”的同步不影响不在执行的线程，所以不能作为逻辑上所有线程的barrier；等待所有线程都跑完倒是能算上是个barrier，但这会儿很多线程的上下文已经被销毁了。既然硬件的同步做不到，那就只能软件做了，所以这个编译器自动加了保存上下文的代码来实现barrier.

b. how to generate efficient stream code.

前面已经提到，local变量可见且在supersteps间共享，那么编译器就要分析supersteps间的数据依赖，并自动分配temporary stream来保存并传递这些变量，那么如何减小总的temporary streams的数量对于有限的显存来说就很重要了，作者采用图形优化机制（后面会提）来解决这个问题，得益于superstep的串行执行，优化后的策略是多项式的时间复杂度。比起CUDA里的手工完成temporary streams管理，我们的方法更加有效地利用了显存。

1.3 其它一些特征：

a. Thread manipulation emulation: fork和kill，很有用但其它语言都没引进。
b. Thread communication intrinsics: allow remote variable access intrinsics for efficient communication between threads.
c. Colletive primitive opertions: 作者给了一个元操作库，实现了reduce/scan/sort等等操作。

2. BSGP VS. Stream Processing

BSGP和stream processing都是SPMD(single program multiple data)方式。
size: 线程总数记作
rank: 每个线程一个rank(0...size-1)，不过我觉得叫id更好。
barrier: 是一种同步语法，在一般的stream processing中，一直等到一个kernel launch终止才算是一个barrier，尽管支持CUDA的硬件设备现在支持local synchronization，在一个kernel里实现所有线程的同步是做不到的。
Collective Opertion: 必须被所有线程同时执行的操作。典型的collective operation(e.g. scan(x))一般里头都有个barrier，而这在一般的stream processing中很少见到。

2.1 拿代码说事儿

右边是我对左边BSGP程序的理解，一共开了n*3=3*3=9个线程，三角形里面红色数字表示逻辑上的线程号，程序中的thread.sortby(v)是这样地，根据键值 v 从小到大的顺序进行线程号的调整，是“稳定排序”。下面是CUDA的实现，使用了CUDDP中的sort例程（面ID和顶点ID一起作为键值）：

程序有三个kernel: before_sort准备好键值，after_sort unpacks排序结果并填充pf[]，make_head计算头指针。Findfaces()加载这些kernel，调用排序元操作，并且维护temporary stream.

2.2 Explict VS. Implicit Data Flow
在CUDA代码中，Data Flow是通过传递参数显示指定地，temporary streams需要被分配来存储这些参数值。而BSGP代码中，local variable在barriers之间可见并且不需要显式传参，在编译器编译生成stream code的时候才有真正的数据流，temporary streams由编译器看情况来分配与回收。

2.3 Efficient Code Reuse
CUDA代码中cudppSort可被重用（包括很多Kernel的函数）。下图(a)中sort key preparation和local sorting是两个阶段，需要一个temorary stream来传递键值。

图

（b）中before_sort和local_sort被绑到一个kernel里实现了，不需要temporary stream了，但是也不好呀，把它从cudppSort中拉出来也不厚道呀，这违反了信息隐藏与代码重用的原则。但是通过collective functions可以做到这一点，即节省了一个temporary stream,又达到了代码重用的目的。

2.4 Source Code Maintenance
CUDA代码维护大家也都知道，汗！而BSGP，multi-superstep算法可被视作collective functions，使用thread.sortby(x)这样的collective function就和使用一般的函数调用一样。任何对最终的stream program的改变都交由compiler来做了。

3. 编译BSGP程序

3.1 编译器设计

先来解决1.2中的第一个问题，barrier synchronization问题。stream processing模型完全地把物理处理单元和逻辑线程分开了，同步完全是由硬件做地，没软件的事儿。而我们这样设计编译器，让它在编译BSGP程序到各流处理器上时生成额外的上下文保存代码，线程人守一份，为了达到更好的性能，上下文件保存代码必须要比手写的temporary stream管理程序要好（至少不能差）。留意一下，时间其实大部分花在了存取数据上（取决于带宽）而不是代码生成上，于是乎这个编译器志在减小要存储的值（值，不是变量）的个数，采取了这样一个策略：存这个值，当且仅当它在superstepX中被定义且在superstepX+Y中被使用，当然还有其它如dead code elimination和conditional constant propagation来近一步减少要存储的值的个数。
另一个就是要减小内存消耗。我们要在两个约束下减小分配的temporary stream的个数：1) 为保存好的值分配temporary stream 2)可能被同时使用的值不能赋给同一个stream.
再一个就是locality问题。数据存在哪儿也是问题，我们引入barrier(RANK_REASSIGNED),虽然一个kernel中物理线程号是定死的，但我们在逻辑上可以重组这些线程嘛（在barrier处实现）。

3.2 编译算法

BSGP能被转为一个stream program通过以下几步：
1. Inline all calls to functions containing barriers.
展开内联函数，如把thread.sortby展开。

2. Perform optimizations to reduce data dependencies.
不存储那些不使用的变量和常量，以减小保存上下文的开销。这里也进行Dead code elimination，删掉那些对于程序输出没用的源码，这种办法可能能消掉cross-superstep间的数据依赖。

3. Separate CPU code and GPU code. Generate kernels and kernel launching code.
根据barrier把程序划分成superstep序列。对于每一个superstep生成一个kernel(包含superstep这段代码的)。插入的那些CPU代码被放在加载kernle的代码之前。

4. Convert references to CPU variables to kernel parameters.
推导出所有需要从GPU传往CPU的参数，只有被CPU、GPU至少要使用一次的变量被传过去。于是乎，Kernel原型和加载代码就被生成了，在BSGP源码中的参数访问也被转成了特定指令在这儿。

5. Find all values that need to be saved, i.e., values used outside the defining superstep.
假设这个变量定义在superStepX，枚举它之后被使用的情况，如果它在接下来的superStepX+Y中会用到，就存它，不然不存。

6. Generate code to save and load the values found in Step 5.
上一步的分析结果被用来生成真正的值saving和loading代码。对于每个值，在定义它的superstep最后才去保存它，而在使用它的superstep一开始就要去load它。
为了支持rank reassigning barrier,值的loading代码可能要有所改变。我们要求每个线程在barrier之后把previous rank赋给thread.oldrank，接下来编译器把值的loading代码放到这个赋值操作之后（在addressing the temporary stream的时候会用到thread.oldrank）。

7. Generate temporary stream allocations.
最后呢，temporary stream被生成并赋给那些保存下来的值。接下来会讲用图优化算法来减小peak memory consumption.

3.3 减小Peak Memory Consumption

此处用到了经典的图论问题，最小流算法。因为要想尽量少的开辟temporary stream，就相当于在Figure4中用最少的路径数覆盖所有的“值”边，当然是从source到sink的。从Figure4.(c)中我们可以看到，两个temporary stream就够了，一个红色地，一个蓝色地。

3.4 实现

作者在CUDA GPU流环境下完成了BSGP编译器，编译需要以下几步：
1. The source code is compiled to static single assignment form (SSA) as in.
从源代码预编译为.i文件。

2. The algorithm in Section 3.2 is carried out on each spawn block’s SSA form.
执行3.2里介绍的编译算法

3. Generated kernels are translated to CUDA assembly code, on which the CUDA assembler is applied. The resulting binary code is inserted into the CPU code as a constant array, and CUDA API calls are generated to load the binary code.
生成的Kernel被翻译成CUDA汇编代码（ptx?），于是CUDA的汇编器就起作用啦。生成的二进制代码作为一个常量数组被导回CPU代码中，当然CUDA API调用要被生成来加载这些二进制代码嘛。

4. The object file or executable is generated from the CPU code by a conventional CPU compiler.
常规的CPU编译器会从CPU代码里生成obj或exe文件。

作者选择SSA表达主要是为了简化数据依赖分析。因为对于一个SSA变量只有一次赋值，咱们之前“值”的概念就和这里的SSA“变量”等价了，这大大简化了数据依赖分析。操作SSA同样允许我们生成优化的汇编代码而不须调用CUDA的内置高层编译器，从而减少编译时间。而CUDA特定的优化，像寄存器分配和指令调度，都还存在，这是因为它们大部分是在汇编器（而不是CUDA编译器）中的。
CUDA同时还提供了local communication和cached memory read。好好利用这些特征对于实现并行元(如scan)是非常重要地。这些特征与BSGP并不矛盾，我们提供给程序员一些接口（类似于CUDA的original high level API）并在内联汇编的时候进行优化。CUDA为kernels提供内置的profiler,为了把它用在BSGP程序中，我们的编译器生成一个日志文件来映射生成的kernel到BSGP源码中的位置。
尽管我们现在是基于CUDA流环境实现地，但编译算法并不基于CUDA特征，可以用在ATI上等。

Limitation.BSGP编译算法在barrier间的流控时候，显地有些无能。最初的BSP模型允许把barrier放在流控结构中，假设所有的barriers要么被所有线程走到要么都走不到。这个特点对于这样的应用程序很适用，就是从头到尾都是并行的应用程序。然而在stream processing中，有控制处理器来负责统一的流控，这样的barrier就大为减少了。这个限制来自于stream processing本身，barrier只能走到当kernel launch结束的时候。

--------------------------------------------------------------------------
以上就是编译器部分的东东了，下面是语法知识，很少也很简单，贴在这里供参考~~
--------------------------------------------------------------------------

4. BSGP语言结构

spawn and barrier (参照list1即可理解)
Cooperation with CPU using require(参照list1即可理解)
Emulating Thread Manipulation(fork and kill)
Reducing barriers using par
communication Intrinsics(thread.get and thread.put)
primitive Operations

A BSGP Primitive Operations
Currently, we provide three kinds of BSGP primitives. The implementation of these primitives are described in the supplementary material.

Supported data parallel primitives include:
• reduce(op, x): Collective reduction of x using operator op. The returned value is the reduction result. op has to be associative, such as max, min and +;
• scan(op, x): Collective forward exclusive scan of x using associative operator op. Scan result overwrites x. The returned value is the reduction result as a byproduct;
• compact(list, src, keep, flag): Collective stream compaction. Each src whose keep is true is compacted and appended to list. flag specifies whether list should be cleared before appending;
• split(list, src, side, flag): Collective stream splitting. Every src is split according to side into two pieces, which are then appended to list, with the false piece preceding the true one. flag has similar semantic as in compact;
• sort idx(key): Collective sorting and index returning. Let Ki be key in thread with rank i and ri be sort idx’s return value. Then ri satisfies Kri<Krj for i < j;

Supported rank adjusting primitives include:
• thread.split(side): Split threads. The rank is reassigned such that a thread with a false side has a smaller rank than a thread with a true side. Relative rank order is preserved among threads of the same side;
• thread.sortby(key): Collective rank reassignment sorting. Let Ki be key in thread with rank i. Thread ranks are adjusted such that after thread.sortby returns, Ki Kj for all i j. Relative rank order is preserved among threads with the same key, i.e., the sort is stable; Supported thread manipulation primitives include:
• thread.kill(flag): Kill the calling thread if flag is true;
• thread.fork(n): Fork n child threads. All child threads inherit the parent’s local variables. A unique ID between 0 and n-1 is returned to each child thread. The parent thread no longer exists after fork;

Note that both fork and kill reassign resulting threads’ ranks to numbers in the range of 0...thread.size-1 while preserving parent threads’
relative rank order.