Halide学习笔记----Halide tutorial源码阅读18
来源:互联网 发布:知柏地黄丸哪家好 编辑:程序博客网 时间:2024/06/08 19:19
Halide入门教程18
// Halide tutorial lesson 18: Factoring an associative reduction using rfactor// Halide教程第18课:用r因子将有关联的约减(约减区域)进行分解// This lesson demonstrates how to parallelize or vectorize an associative// reduction using the scheduling directive 'rfactor'.// 本课展示如何用r因子对有关联的约减(约减区域)进行并行化和向量化// On linux, you can compile and run it like so:// g++ lesson_18*.cpp -g -I ../include -L ../bin -lHalide -lpthread -ldl -o lesson_18 -std=c++11// LD_LIBRARY_PATH=../bin ./lesson_18#include "Halide.h"#include <stdio.h>using namespace Halide;int main(int argc, char **argv) { // Declare some Vars to use below. Var x("x"), y("y"), i("i"), u("u"), v("v"); // Create an input with random values. Buffer<uint8_t> input(8, 8, "input"); for (int y = 0; y < 8; ++y) { for (int x = 0; x < 8; ++x) { input(x, y) = (rand() % 256); } } { // As mentioned previously in lesson 9, parallelizing variables that // are part of a reduction domain is tricky, since there may be data // dependencies across those variables. // 如同第9课所提到的那样,约减区域内的变量进行并行处理需要很高的技巧性,因为数据之间可能存在依赖关系 // Consider the histogram example in lesson 9: Func histogram("hist_serial"); histogram(i) = 0; RDom r(0, input.width(), 0, input.height()); histogram(input(r.x, r.y) / 32) += 1; histogram.vectorize(i, 8); histogram.realize(8); // See figures/lesson_18_hist_serial.mp4 for a visualization of // what this does. // We can vectorize the initialization of the histogram // buckets, but since there are data dependencies across r.x // and r.y in the update definition (i.e. the update refers to // value computed in the previous iteration), we can't // parallelize or vectorize r.x or r.y without introducing a // race condition. The following code would produce an error: // histogram.update().parallel(r.y); // 我们可以在初始化阶段将histogram向量化,但是在更新阶段有有数据依赖性,在没有引入竞争条件约束时 // 不能多r.x或者r.y进行向量化或者并行化。 } { // Note, however, that the histogram operation (which is a // kind of sum reduction) is associative. A common trick to // speed-up associative reductions is to slice up the // reduction domain into smaller slices, compute a partial // result over each slice, and then merge the results. Since // the computation of each slice is independent, we can // parallelize over slices. // 既然直方图操作是具有关联的。一个常用的加速小技巧就是将有关联的区域切分成小的碎片,在每一个小切片 // 上做统计,然后汇总所有结果。由于每个小切片是相互独立的,因而我们可以在切片层级进行并行化。 // Going back to the histogram example, we slice the reduction // domain into rows by defining an intermediate function that // computes the histogram of each row independently: // 沿着行方向切片,然后每一行的统计就相互独立了。这是第一步,切片划分 Func intermediate("intm_par_manual"); intermediate(i, y) = 0; RDom rx(0, input.width()); intermediate(input(rx, y) / 32, y) += 1; // We then define a second stage which sums those partial // results: // 然后定义第二阶段,加上每个切片的结果。这是第二步,汇总结果。 Func histogram("merge_par_manual"); histogram(i) = 0; RDom ry(0, input.height()); histogram(i) += intermediate(i, ry); // Since the intermediate no longer has data dependencies // across the y dimension, we can parallelize it over y: // 由于中间变量不再有数据的依赖性,因此可以沿着y方向并行化。 intermediate.compute_root().update().parallel(y); // We can also vectorize the initializations. // 初始化过程可以向量化 intermediate.vectorize(i, 8); histogram.vectorize(i, 8); histogram.realize(8); // See figures/lesson_18_hist_manual_par.mp4 for a visualization of // what this does. } { // This manual factorization of an associative reduction can // be tedious and bug-prone. Although it's fairly easy to do // manually for the histogram, it can get complex pretty fast, // especially if the RDom may has a predicate (RDom::where), // or when the function reduces onto a multi-dimensional // tuple. // Halide provides a way to do this type of factorization // through the scheduling directive 'rfactor'. rfactor splits // an associative update definition into an intermediate which // computes the partial results over slices of a reduction // domain and replaces the current update definition with a // new definition which merges those partial results. // Halide提供一种做这种分解方法的调度指令rfactor。rfactor将有依赖更新区域分解成一系列的小切片, // 对小切片分别进行处理后,将所有结果汇总起来,得到整体的结果。 // Using rfactor, we don't need to change the algorithm at all: // 使用rfactor时,不需要改变算法描述部分。 Func histogram("hist_rfactor_par"); histogram(x) = 0; RDom r(0, input.width(), 0, input.height()); histogram(input(r.x, r.y) / 32) += 1; // The task of factoring of associative reduction is moved // into the schedule, via rfactor. rfactor takes as input a // list of <RVar, Var> pairs, which contains list of reduction // variables (RVars) to be made "parallelizable". In the // generated intermediate Func, all references to this // reduction variables are replaced with references to "pure" // variables (the Vars). Since, by construction, Vars are // race-condition free, the intermediate reduction is now // parallelizable across those dimensions. All reduction // variables not in the list are removed from the original // function and "lifted" to the intermediate. // 通过rfactor对有关联的约减任务分解被移动到调度部分了。rfactor将一个<RVar, Var>对作为输入, // 这里RVar是将要并行化的RDom变量,Var是一个可以并行化的纯Var变量。在中间函数里,所有的约减变量 // 被纯Var变量替代了。因而这样处理的中间函数可以在新的维度上进行并行化。 // To generate the same code as the manually-factored version, // we do the following: // 为了生成和手工分解一样的版本的代码,按照如下方式进行划分。 Func intermediate = histogram.update().rfactor({{r.y, y}}); // We pass {r.y, y} as the argument to rfactor to make the // histogram parallelizable across the y dimension, similar to // the manually-factored version. // 沿y方向进行并行。 intermediate.compute_root().update().parallel(y); // In the case where you are only slicing up the domain across // a single variable, you can actually drop the braces and // write the rfactor the following way. // 在只有一个维度进行切分的情况下,可以缺掉大括号,按照如下方式进行切分。 // Func intermediate = histogram.update().rfactor(r.y, y); // Vectorize the initializations, as we did above. // 初始化过程向量化 intermediate.vectorize(x, 8); histogram.vectorize(x, 8); // It is important to note that rfactor (or reduction // factorization in general) only works for associative // reductions. Associative reductions have the nice property // that their results are the same no matter how the // computation is grouped (i.e. split into chunks). If rfactor // can't prove the associativity of a reduction, it will throw // an error. // rfactor只有在有依赖的RDom起作用。有依赖的RDom有很好的性质,不管如何拆分,最后汇总时的结果是一致 // 的,如果rfactor不能约减去有的关联性,它将抛出异常。 Buffer<int> halide_result = histogram.realize(8); // See figures/lesson_18_hist_rfactor_par.mp4 for a // visualization of what this does. // The equivalent C is: int c_intm[8][8]; for (int y = 0; y < input.height(); y++) { for (int x = 0; x < 8; x++) { c_intm[y][x] = 0; } } /* parallel */ for (int y = 0; y < input.height(); y++) { for (int r_x = 0; r_x < input.width(); r_x++) { c_intm[y][input(r_x, y) / 32] += 1; } } int c_result[8]; for (int x = 0; x < 8; x++) { c_result[x] = 0; } for (int x = 0; x < 8; x++) { for (int r_y = 0; r_y < input.height(); r_y++) { c_result[x] += c_intm[r_y][x]; } } // Check the answers agree: for (int x = 0; x < 8; x++) { if (c_result[x] != halide_result(x)) { printf("halide_result(%d) = %d instead of %d\n", x, halide_result(x), c_result[x]); return -1; } } } { // Now that we can factor associative reductions with the // scheduling directive 'rfactor', we can explore various // factorization strategies using the schedule alone. Given // the same serial histogram code: // 探索其他的拆分策略 Func histogram("hist_rfactor_vec"); histogram(x) = 0; RDom r(0, input.width(), 0, input.height()); histogram(input(r.x, r.y) / 32) += 1; // Instead of r.y, we rfactor on r.x this time to slice the // domain into columns. // 沿x方向进行切片,将图像切成列形式的。 Func intermediate = histogram.update().rfactor(r.x, u); // Now that we're computing an independent histogram // per-column, we can vectorize over columns. // 我们可以在列方向上向量化。 intermediate.compute_root().update().vectorize(u, 8); // Note that since vectorizing the inner dimension changes the // order in which values are added to the final histogram // buckets computations, so this trick only works if the // associative reduction is associative *and* // commutative. rfactor will attempt to prove these properties // hold and will throw an error if it can't. // 由于内层循环向量化会改变最后直方图汇总的顺序,因此这个小机器只在约减区域是累积求和型的约减起作用。 // 如果不满足这种情况,那么rfactor会抛出错误。 // Vectorize the initializations. intermediate.vectorize(x, 8); histogram.vectorize(x, 8); Buffer<int> halide_result = histogram.realize(8); // See figures/lesson_18_hist_rfactor_vec.mp4 for a // visualization of what this does. // The equivalent C is: int c_intm[8][8]; for (int u = 0; u < input.width(); u++) { for (int x = 0; x < 8; x++) { c_intm[u][x] = 0; } } for (int r_y = 0; r_y < input.height(); r_y++) { for (int u = 0; u < input.width() / 8; u++) { /* vectorize */ for (int u_i = 0; u_i < 8; u_i++) { c_intm[u*4 + u_i][input(u*8 + u_i, r_y) / 32] += 1; } } } int c_result[8]; for (int x = 0; x < 8; x++) { c_result[x] = 0; } for (int x = 0; x < 8; x++) { for (int r_x = 0; r_x < input.width(); r_x++) { c_result[x] += c_intm[r_x][x]; } } // Check the answers agree: for (int x = 0; x < 8; x++) { if (c_result[x] != halide_result(x)) { printf("halide_result(%d) = %d instead of %d\n", x, halide_result(x), c_result[x]); return -1; } } } { // We can also slice a reduction domain up over multiple // dimensions at once. This time, we'll compute partial // histograms over tiles of the domain. // 在多个维度上同时切片。以tile形式计算每一个tile的直方图,然后汇总统计 Func histogram("hist_rfactor_tile"); histogram(x) = 0; RDom r(0, input.width(), 0, input.height()); histogram(input(r.x, r.y) / 32) += 1; // We first split both r.x and r.y by a factor of four. RVar rx_outer("rx_outer"), rx_inner("rx_inner"); RVar ry_outer("ry_outer"), ry_inner("ry_inner"); histogram.update() .split(r.x, rx_outer, rx_inner, 4) .split(r.y, ry_outer, ry_inner, 4); // We now call rfactor to make an intermediate function that // independently computes a histogram of each tile. // 调用rfactor产生一个中间行数,独立地计算每一个tile的直方图。 Func intermediate = histogram.update().rfactor({{rx_outer, u}, {ry_outer, v}}); // We can now parallelize the intermediate over tiles. // 我们可以在中间tile的中间函数上进行并行化。 intermediate.compute_root().update().parallel(u).parallel(v); // We also reorder the tile indices outermost to give the // classic tiled traversal. // reorder每个tile的外层循环,按照经典的tile模式进行遍历。 intermediate.update().reorder(rx_inner, ry_inner, u, v); // Vectorize the initializations. // 初始化过程向量化。 intermediate.vectorize(x, 8); histogram.vectorize(x, 8); Buffer<int> halide_result = histogram.realize(8); // See figures/lesson_18_hist_rfactor_tile.mp4 for a visualization of // what this does. // The equivalent C is: int c_intm[4][4][8]; for (int v = 0; v < input.height() / 2; v++) { for (int u = 0; u < input.width() / 2; u++) { for (int x = 0; x < 8; x++) { c_intm[v][u][x] = 0; } } } /* parallel */ for (int v = 0; v < input.height() / 2; v++) { /* parallel */ for (int u = 0; u < input.width() / 2; u++) { for (int ry_inner = 0; ry_inner < 2; ry_inner++) { for (int rx_inner = 0; rx_inner < 2; rx_inner++) { c_intm[v][u][input(u*2 + rx_inner, v*2 + ry_inner) / 32] += 1; } } } } int c_result[8]; for (int x = 0; x < 8; x++) { c_result[x] = 0; } for (int x = 0; x < 8; x++) { for (int ry_outer = 0; ry_outer < input.height() / 2; ry_outer++) { for (int rx_outer = 0; rx_outer < input.width() / 2; rx_outer++) { c_result[x] += c_intm[ry_outer][rx_outer][x]; } } } // Check the answers agree: for (int x = 0; x < 8; x++) { if (c_result[x] != halide_result(x)) { printf("halide_result(%d) = %d instead of %d\n", x, halide_result(x), c_result[x]); return -1; } } } printf("Success!\n"); return 0;}
本节主要讲解通过rfactor对约减区域进行拆分并行化。
1.沿y方向拆分
intermediate = histogram.update().rfactor(r.y, y);
intermediate.compute_root().update().parallel(y);
2.沿x方向拆分,并向量化
intermediate = histogram.update().rfactor(r.x, u);
intermediate.compute_root().update().vectorize(u, 8);
3.拆分成tile
histogram.update().split(r.x, rx_outer, rx_inner, 4).split(r.y, ry_outer, ry_inner, 4);
intermediate = histogram.update().rfactor({{rx_outer, u}, {ry_outer, v}});
intermediate.compute_root().update().parallel(u).parallel(v);
intermediate.update().reorder(rx_inner, ry_inner, u, v);
阅读全文