Halide学习笔记----Halide tutorial源码阅读18

来源：互联网发布：知柏地黄丸哪家好编辑：程序博客网时间：2024/06/08 19:19
Halide入门教程18

// Halide tutorial lesson 18: Factoring an associative reduction using rfactor// Halide教程第18课：用r因子将有关联的约减（约减区域）进行分解// This lesson demonstrates how to parallelize or vectorize an associative// reduction using the scheduling directive 'rfactor'.// 本课展示如何用r因子对有关联的约减（约减区域）进行并行化和向量化// On linux, you can compile and run it like so:// g++ lesson_18*.cpp -g -I ../include -L ../bin -lHalide -lpthread -ldl -o lesson_18 -std=c++11// LD_LIBRARY_PATH=../bin ./lesson_18#include "Halide.h"#include <stdio.h>using namespace Halide;int main(int argc, char **argv) {    // Declare some Vars to use below.    Var x("x"), y("y"), i("i"), u("u"), v("v");    // Create an input with random values.    Buffer<uint8_t> input(8, 8, "input");    for (int y = 0; y < 8; ++y) {        for (int x = 0; x < 8; ++x) {            input(x, y) = (rand() % 256);        }    }    {        // As mentioned previously in lesson 9, parallelizing variables that        // are part of a reduction domain is tricky, since there may be data        // dependencies across those variables.        // 如同第9课所提到的那样，约减区域内的变量进行并行处理需要很高的技巧性，因为数据之间可能存在依赖关系        // Consider the histogram example in lesson 9:        Func histogram("hist_serial");        histogram(i) = 0;        RDom r(0, input.width(), 0, input.height());        histogram(input(r.x, r.y) / 32) += 1;        histogram.vectorize(i, 8);        histogram.realize(8);        // See figures/lesson_18_hist_serial.mp4 for a visualization of        // what this does.        // We can vectorize the initialization of the histogram        // buckets, but since there are data dependencies across r.x        // and r.y in the update definition (i.e. the update refers to        // value computed in the previous iteration), we can't        // parallelize or vectorize r.x or r.y without introducing a        // race condition. The following code would produce an error:        // histogram.update().parallel(r.y);        // 我们可以在初始化阶段将histogram向量化，但是在更新阶段有有数据依赖性，在没有引入竞争条件约束时        // 不能多r.x或者r.y进行向量化或者并行化。    }    {        // Note, however, that the histogram operation (which is a        // kind of sum reduction) is associative. A common trick to        // speed-up associative reductions is to slice up the        // reduction domain into smaller slices, compute a partial        // result over each slice, and then merge the results. Since        // the computation of each slice is independent, we can        // parallelize over slices.        // 既然直方图操作是具有关联的。一个常用的加速小技巧就是将有关联的区域切分成小的碎片，在每一个小切片        // 上做统计，然后汇总所有结果。由于每个小切片是相互独立的，因而我们可以在切片层级进行并行化。        // Going back to the histogram example, we slice the reduction        // domain into rows by defining an intermediate function that        // computes the histogram of each row independently:        // 沿着行方向切片，然后每一行的统计就相互独立了。这是第一步，切片划分        Func intermediate("intm_par_manual");        intermediate(i, y) = 0;        RDom rx(0, input.width());        intermediate(input(rx, y) / 32, y) += 1;        // We then define a second stage which sums those partial        // results:        // 然后定义第二阶段，加上每个切片的结果。这是第二步，汇总结果。        Func histogram("merge_par_manual");        histogram(i) = 0;        RDom ry(0, input.height());        histogram(i) += intermediate(i, ry);        // Since the intermediate no longer has data dependencies        // across the y dimension, we can parallelize it over y:        // 由于中间变量不再有数据的依赖性，因此可以沿着y方向并行化。        intermediate.compute_root().update().parallel(y);        // We can also vectorize the initializations.        // 初始化过程可以向量化        intermediate.vectorize(i, 8);        histogram.vectorize(i, 8);        histogram.realize(8);        // See figures/lesson_18_hist_manual_par.mp4 for a visualization of        // what this does.    }    {        // This manual factorization of an associative reduction can        // be tedious and bug-prone. Although it's fairly easy to do        // manually for the histogram, it can get complex pretty fast,        // especially if the RDom may has a predicate (RDom::where),        // or when the function reduces onto a multi-dimensional        // tuple.        // Halide provides a way to do this type of factorization        // through the scheduling directive 'rfactor'. rfactor splits        // an associative update definition into an intermediate which        // computes the partial results over slices of a reduction        // domain and replaces the current update definition with a        // new definition which merges those partial results.        // Halide提供一种做这种分解方法的调度指令rfactor。rfactor将有依赖更新区域分解成一系列的小切片，        // 对小切片分别进行处理后，将所有结果汇总起来，得到整体的结果。        // Using rfactor, we don't need to change the algorithm at all:        // 使用rfactor时，不需要改变算法描述部分。        Func histogram("hist_rfactor_par");        histogram(x) = 0;        RDom r(0, input.width(), 0, input.height());        histogram(input(r.x, r.y) / 32) += 1;        // The task of factoring of associative reduction is moved        // into the schedule, via rfactor. rfactor takes as input a        // list of <RVar, Var> pairs, which contains list of reduction        // variables (RVars) to be made "parallelizable". In the        // generated intermediate Func, all references to this        // reduction variables are replaced with references to "pure"        // variables (the Vars). Since, by construction, Vars are        // race-condition free, the intermediate reduction is now        // parallelizable across those dimensions. All reduction        // variables not in the list are removed from the original        // function and "lifted" to the intermediate.        // 通过rfactor对有关联的约减任务分解被移动到调度部分了。rfactor将一个<RVar, Var>对作为输入，        // 这里RVar是将要并行化的RDom变量，Var是一个可以并行化的纯Var变量。在中间函数里，所有的约减变量        // 被纯Var变量替代了。因而这样处理的中间函数可以在新的维度上进行并行化。        // To generate the same code as the manually-factored version,        // we do the following:        // 为了生成和手工分解一样的版本的代码，按照如下方式进行划分。        Func intermediate = histogram.update().rfactor({{r.y, y}});        // We pass {r.y, y} as the argument to rfactor to make the        // histogram parallelizable across the y dimension, similar to        // the manually-factored version.        // 沿y方向进行并行。        intermediate.compute_root().update().parallel(y);        // In the case where you are only slicing up the domain across        // a single variable, you can actually drop the braces and        // write the rfactor the following way.        // 在只有一个维度进行切分的情况下，可以缺掉大括号，按照如下方式进行切分。        // Func intermediate = histogram.update().rfactor(r.y, y);        // Vectorize the initializations, as we did above.        // 初始化过程向量化        intermediate.vectorize(x, 8);        histogram.vectorize(x, 8);        // It is important to note that rfactor (or reduction        // factorization in general) only works for associative        // reductions. Associative reductions have the nice property        // that their results are the same no matter how the        // computation is grouped (i.e. split into chunks). If rfactor        // can't prove the associativity of a reduction, it will throw        // an error.        // rfactor只有在有依赖的RDom起作用。有依赖的RDom有很好的性质，不管如何拆分，最后汇总时的结果是一致        // 的，如果rfactor不能约减去有的关联性，它将抛出异常。        Buffer<int> halide_result = histogram.realize(8);        // See figures/lesson_18_hist_rfactor_par.mp4 for a        // visualization of what this does.        // The equivalent C is:        int c_intm[8][8];        for (int y = 0; y < input.height(); y++) {            for (int x = 0; x < 8; x++) {                c_intm[y][x] = 0;            }        }        /* parallel */ for (int y = 0; y < input.height(); y++) {            for (int r_x = 0; r_x < input.width(); r_x++) {                c_intm[y][input(r_x, y) / 32] += 1;            }        }        int c_result[8];        for (int x = 0; x < 8; x++) {            c_result[x] = 0;        }        for (int x = 0; x < 8; x++) {            for (int r_y = 0; r_y < input.height(); r_y++) {                c_result[x] += c_intm[r_y][x];            }        }        // Check the answers agree:        for (int x = 0; x < 8; x++) {            if (c_result[x] != halide_result(x)) {                printf("halide_result(%d) = %d instead of %d\n",                       x, halide_result(x), c_result[x]);                return -1;            }        }    }    {        // Now that we can factor associative reductions with the        // scheduling directive 'rfactor', we can explore various        // factorization strategies using the schedule alone. Given        // the same serial histogram code:        // 探索其他的拆分策略        Func histogram("hist_rfactor_vec");        histogram(x) = 0;        RDom r(0, input.width(), 0, input.height());        histogram(input(r.x, r.y) / 32) += 1;        // Instead of r.y, we rfactor on r.x this time to slice the        // domain into columns.        // 沿x方向进行切片，将图像切成列形式的。        Func intermediate = histogram.update().rfactor(r.x, u);        // Now that we're computing an independent histogram        // per-column, we can vectorize over columns.        // 我们可以在列方向上向量化。        intermediate.compute_root().update().vectorize(u, 8);        // Note that since vectorizing the inner dimension changes the        // order in which values are added to the final histogram        // buckets computations, so this trick only works if the        // associative reduction is associative *and*        // commutative. rfactor will attempt to prove these properties        // hold and will throw an error if it can't.        // 由于内层循环向量化会改变最后直方图汇总的顺序，因此这个小机器只在约减区域是累积求和型的约减起作用。        // 如果不满足这种情况，那么rfactor会抛出错误。        // Vectorize the initializations.        intermediate.vectorize(x, 8);        histogram.vectorize(x, 8);        Buffer<int> halide_result = histogram.realize(8);        // See figures/lesson_18_hist_rfactor_vec.mp4 for a        // visualization of what this does.        // The equivalent C is:        int c_intm[8][8];        for (int u = 0; u < input.width(); u++) {            for (int x = 0; x < 8; x++) {                c_intm[u][x] = 0;            }        }        for (int r_y = 0; r_y < input.height(); r_y++) {            for (int u = 0; u < input.width() / 8; u++) {                /* vectorize */ for (int u_i = 0; u_i < 8; u_i++) {                    c_intm[u*4 + u_i][input(u*8 + u_i, r_y) / 32] += 1;                }            }        }        int c_result[8];        for (int x = 0; x < 8; x++) {            c_result[x] = 0;        }        for (int x = 0; x < 8; x++) {            for (int r_x = 0; r_x < input.width(); r_x++) {                c_result[x] += c_intm[r_x][x];            }        }        // Check the answers agree:        for (int x = 0; x < 8; x++) {            if (c_result[x] != halide_result(x)) {                printf("halide_result(%d) = %d instead of %d\n",                       x, halide_result(x), c_result[x]);                return -1;            }        }    }    {        // We can also slice a reduction domain up over multiple        // dimensions at once. This time, we'll compute partial        // histograms over tiles of the domain.        // 在多个维度上同时切片。以tile形式计算每一个tile的直方图，然后汇总统计        Func histogram("hist_rfactor_tile");        histogram(x) = 0;        RDom r(0, input.width(), 0, input.height());        histogram(input(r.x, r.y) / 32) += 1;        // We first split both r.x and r.y by a factor of four.        RVar rx_outer("rx_outer"), rx_inner("rx_inner");        RVar ry_outer("ry_outer"), ry_inner("ry_inner");        histogram.update()            .split(r.x, rx_outer, rx_inner, 4)            .split(r.y, ry_outer, ry_inner, 4);        // We now call rfactor to make an intermediate function that        // independently computes a histogram of each tile.        // 调用rfactor产生一个中间行数，独立地计算每一个tile的直方图。        Func intermediate = histogram.update().rfactor({{rx_outer, u}, {ry_outer, v}});        // We can now parallelize the intermediate over tiles.        // 我们可以在中间tile的中间函数上进行并行化。        intermediate.compute_root().update().parallel(u).parallel(v);        // We also reorder the tile indices outermost to give the        // classic tiled traversal.        // reorder每个tile的外层循环，按照经典的tile模式进行遍历。        intermediate.update().reorder(rx_inner, ry_inner, u, v);        // Vectorize the initializations.        // 初始化过程向量化。        intermediate.vectorize(x, 8);        histogram.vectorize(x, 8);        Buffer<int> halide_result = histogram.realize(8);        // See figures/lesson_18_hist_rfactor_tile.mp4 for a visualization of        // what this does.        // The equivalent C is:        int c_intm[4][4][8];        for (int v = 0; v < input.height() / 2; v++) {            for (int u = 0; u < input.width() / 2; u++) {                for (int x = 0; x < 8; x++) {                    c_intm[v][u][x] = 0;                }            }        }        /* parallel */ for (int v = 0; v < input.height() / 2; v++) {            /* parallel */ for (int u = 0; u < input.width() / 2; u++) {                for (int ry_inner = 0; ry_inner < 2; ry_inner++) {                    for (int rx_inner = 0; rx_inner < 2; rx_inner++) {                        c_intm[v][u][input(u*2 + rx_inner, v*2 + ry_inner) / 32] += 1;                    }                }            }        }        int c_result[8];        for (int x = 0; x < 8; x++) {            c_result[x] = 0;        }        for (int x = 0; x < 8; x++) {            for (int ry_outer = 0; ry_outer < input.height() / 2; ry_outer++) {                for (int rx_outer = 0; rx_outer < input.width() / 2; rx_outer++) {                    c_result[x] += c_intm[ry_outer][rx_outer][x];                }            }        }        // Check the answers agree:        for (int x = 0; x < 8; x++) {            if (c_result[x] != halide_result(x)) {                printf("halide_result(%d) = %d instead of %d\n",                       x, halide_result(x), c_result[x]);                return -1;            }        }    }    printf("Success!\n");    return 0;}
本节主要讲解通过rfactor对约减区域进行拆分并行化。
1.沿y方向拆分
intermediate = histogram.update().rfactor(r.y, y);
intermediate.compute_root().update().parallel(y);
2.沿x方向拆分，并向量化
intermediate = histogram.update().rfactor(r.x, u);
intermediate.compute_root().update().vectorize(u, 8);
3.拆分成tile
histogram.update().split(r.x, rx_outer, rx_inner, 4).split(r.y, ry_outer, ry_inner, 4);
intermediate = histogram.update().rfactor({{rx_outer, u}, {ry_outer, v}});
intermediate.compute_root().update().parallel(u).parallel(v);
intermediate.update().reorder(rx_inner, ry_inner, u, v);
阅读全文
'); })();