Halide学习笔记----Halide tutorial源码阅读18

来源:互联网 发布:知柏地黄丸哪家好 编辑:程序博客网 时间:2024/06/08 19:19

Halide入门教程18


// Halide tutorial lesson 18: Factoring an associative reduction using rfactor// Halide教程第18课:用r因子将有关联的约减(约减区域)进行分解// This lesson demonstrates how to parallelize or vectorize an associative// reduction using the scheduling directive 'rfactor'.// 本课展示如何用r因子对有关联的约减(约减区域)进行并行化和向量化// On linux, you can compile and run it like so:// g++ lesson_18*.cpp -g -I ../include -L ../bin -lHalide -lpthread -ldl -o lesson_18 -std=c++11// LD_LIBRARY_PATH=../bin ./lesson_18#include "Halide.h"#include <stdio.h>using namespace Halide;int main(int argc, char **argv) {    // Declare some Vars to use below.    Var x("x"), y("y"), i("i"), u("u"), v("v");    // Create an input with random values.    Buffer<uint8_t> input(8, 8, "input");    for (int y = 0; y < 8; ++y) {        for (int x = 0; x < 8; ++x) {            input(x, y) = (rand() % 256);        }    }    {        // As mentioned previously in lesson 9, parallelizing variables that        // are part of a reduction domain is tricky, since there may be data        // dependencies across those variables.        // 如同第9课所提到的那样,约减区域内的变量进行并行处理需要很高的技巧性,因为数据之间可能存在依赖关系        // Consider the histogram example in lesson 9:        Func histogram("hist_serial");        histogram(i) = 0;        RDom r(0, input.width(), 0, input.height());        histogram(input(r.x, r.y) / 32) += 1;        histogram.vectorize(i, 8);        histogram.realize(8);        // See figures/lesson_18_hist_serial.mp4 for a visualization of        // what this does.        // We can vectorize the initialization of the histogram        // buckets, but since there are data dependencies across r.x        // and r.y in the update definition (i.e. the update refers to        // value computed in the previous iteration), we can't        // parallelize or vectorize r.x or r.y without introducing a        // race condition. The following code would produce an error:        // histogram.update().parallel(r.y);        // 我们可以在初始化阶段将histogram向量化,但是在更新阶段有有数据依赖性,在没有引入竞争条件约束时        // 不能多r.x或者r.y进行向量化或者并行化。    }    {        // Note, however, that the histogram operation (which is a        // kind of sum reduction) is associative. A common trick to        // speed-up associative reductions is to slice up the        // reduction domain into smaller slices, compute a partial        // result over each slice, and then merge the results. Since        // the computation of each slice is independent, we can        // parallelize over slices.        // 既然直方图操作是具有关联的。一个常用的加速小技巧就是将有关联的区域切分成小的碎片,在每一个小切片        // 上做统计,然后汇总所有结果。由于每个小切片是相互独立的,因而我们可以在切片层级进行并行化。        // Going back to the histogram example, we slice the reduction        // domain into rows by defining an intermediate function that        // computes the histogram of each row independently:        // 沿着行方向切片,然后每一行的统计就相互独立了。这是第一步,切片划分        Func intermediate("intm_par_manual");        intermediate(i, y) = 0;        RDom rx(0, input.width());        intermediate(input(rx, y) / 32, y) += 1;        // We then define a second stage which sums those partial        // results:        // 然后定义第二阶段,加上每个切片的结果。这是第二步,汇总结果。        Func histogram("merge_par_manual");        histogram(i) = 0;        RDom ry(0, input.height());        histogram(i) += intermediate(i, ry);        // Since the intermediate no longer has data dependencies        // across the y dimension, we can parallelize it over y:        // 由于中间变量不再有数据的依赖性,因此可以沿着y方向并行化。        intermediate.compute_root().update().parallel(y);        // We can also vectorize the initializations.        // 初始化过程可以向量化        intermediate.vectorize(i, 8);        histogram.vectorize(i, 8);        histogram.realize(8);        // See figures/lesson_18_hist_manual_par.mp4 for a visualization of        // what this does.    }    {        // This manual factorization of an associative reduction can        // be tedious and bug-prone. Although it's fairly easy to do        // manually for the histogram, it can get complex pretty fast,        // especially if the RDom may has a predicate (RDom::where),        // or when the function reduces onto a multi-dimensional        // tuple.        // Halide provides a way to do this type of factorization        // through the scheduling directive 'rfactor'. rfactor splits        // an associative update definition into an intermediate which        // computes the partial results over slices of a reduction        // domain and replaces the current update definition with a        // new definition which merges those partial results.        // Halide提供一种做这种分解方法的调度指令rfactor。rfactor将有依赖更新区域分解成一系列的小切片,        // 对小切片分别进行处理后,将所有结果汇总起来,得到整体的结果。        // Using rfactor, we don't need to change the algorithm at all:        // 使用rfactor时,不需要改变算法描述部分。        Func histogram("hist_rfactor_par");        histogram(x) = 0;        RDom r(0, input.width(), 0, input.height());        histogram(input(r.x, r.y) / 32) += 1;        // The task of factoring of associative reduction is moved        // into the schedule, via rfactor. rfactor takes as input a        // list of <RVar, Var> pairs, which contains list of reduction        // variables (RVars) to be made "parallelizable". In the        // generated intermediate Func, all references to this        // reduction variables are replaced with references to "pure"        // variables (the Vars). Since, by construction, Vars are        // race-condition free, the intermediate reduction is now        // parallelizable across those dimensions. All reduction        // variables not in the list are removed from the original        // function and "lifted" to the intermediate.        // 通过rfactor对有关联的约减任务分解被移动到调度部分了。rfactor将一个<RVar, Var>对作为输入,        // 这里RVar是将要并行化的RDom变量,Var是一个可以并行化的纯Var变量。在中间函数里,所有的约减变量        // 被纯Var变量替代了。因而这样处理的中间函数可以在新的维度上进行并行化。        // To generate the same code as the manually-factored version,        // we do the following:        // 为了生成和手工分解一样的版本的代码,按照如下方式进行划分。        Func intermediate = histogram.update().rfactor({{r.y, y}});        // We pass {r.y, y} as the argument to rfactor to make the        // histogram parallelizable across the y dimension, similar to        // the manually-factored version.        // 沿y方向进行并行。        intermediate.compute_root().update().parallel(y);        // In the case where you are only slicing up the domain across        // a single variable, you can actually drop the braces and        // write the rfactor the following way.        // 在只有一个维度进行切分的情况下,可以缺掉大括号,按照如下方式进行切分。        // Func intermediate = histogram.update().rfactor(r.y, y);        // Vectorize the initializations, as we did above.        // 初始化过程向量化        intermediate.vectorize(x, 8);        histogram.vectorize(x, 8);        // It is important to note that rfactor (or reduction        // factorization in general) only works for associative        // reductions. Associative reductions have the nice property        // that their results are the same no matter how the        // computation is grouped (i.e. split into chunks). If rfactor        // can't prove the associativity of a reduction, it will throw        // an error.        // rfactor只有在有依赖的RDom起作用。有依赖的RDom有很好的性质,不管如何拆分,最后汇总时的结果是一致        // 的,如果rfactor不能约减去有的关联性,它将抛出异常。        Buffer<int> halide_result = histogram.realize(8);        // See figures/lesson_18_hist_rfactor_par.mp4 for a        // visualization of what this does.        // The equivalent C is:        int c_intm[8][8];        for (int y = 0; y < input.height(); y++) {            for (int x = 0; x < 8; x++) {                c_intm[y][x] = 0;            }        }        /* parallel */ for (int y = 0; y < input.height(); y++) {            for (int r_x = 0; r_x < input.width(); r_x++) {                c_intm[y][input(r_x, y) / 32] += 1;            }        }        int c_result[8];        for (int x = 0; x < 8; x++) {            c_result[x] = 0;        }        for (int x = 0; x < 8; x++) {            for (int r_y = 0; r_y < input.height(); r_y++) {                c_result[x] += c_intm[r_y][x];            }        }        // Check the answers agree:        for (int x = 0; x < 8; x++) {            if (c_result[x] != halide_result(x)) {                printf("halide_result(%d) = %d instead of %d\n",                       x, halide_result(x), c_result[x]);                return -1;            }        }    }    {        // Now that we can factor associative reductions with the        // scheduling directive 'rfactor', we can explore various        // factorization strategies using the schedule alone. Given        // the same serial histogram code:        // 探索其他的拆分策略        Func histogram("hist_rfactor_vec");        histogram(x) = 0;        RDom r(0, input.width(), 0, input.height());        histogram(input(r.x, r.y) / 32) += 1;        // Instead of r.y, we rfactor on r.x this time to slice the        // domain into columns.        // 沿x方向进行切片,将图像切成列形式的。        Func intermediate = histogram.update().rfactor(r.x, u);        // Now that we're computing an independent histogram        // per-column, we can vectorize over columns.        // 我们可以在列方向上向量化。        intermediate.compute_root().update().vectorize(u, 8);        // Note that since vectorizing the inner dimension changes the        // order in which values are added to the final histogram        // buckets computations, so this trick only works if the        // associative reduction is associative *and*        // commutative. rfactor will attempt to prove these properties        // hold and will throw an error if it can't.        // 由于内层循环向量化会改变最后直方图汇总的顺序,因此这个小机器只在约减区域是累积求和型的约减起作用。        // 如果不满足这种情况,那么rfactor会抛出错误。        // Vectorize the initializations.        intermediate.vectorize(x, 8);        histogram.vectorize(x, 8);        Buffer<int> halide_result = histogram.realize(8);        // See figures/lesson_18_hist_rfactor_vec.mp4 for a        // visualization of what this does.        // The equivalent C is:        int c_intm[8][8];        for (int u = 0; u < input.width(); u++) {            for (int x = 0; x < 8; x++) {                c_intm[u][x] = 0;            }        }        for (int r_y = 0; r_y < input.height(); r_y++) {            for (int u = 0; u < input.width() / 8; u++) {                /* vectorize */ for (int u_i = 0; u_i < 8; u_i++) {                    c_intm[u*4 + u_i][input(u*8 + u_i, r_y) / 32] += 1;                }            }        }        int c_result[8];        for (int x = 0; x < 8; x++) {            c_result[x] = 0;        }        for (int x = 0; x < 8; x++) {            for (int r_x = 0; r_x < input.width(); r_x++) {                c_result[x] += c_intm[r_x][x];            }        }        // Check the answers agree:        for (int x = 0; x < 8; x++) {            if (c_result[x] != halide_result(x)) {                printf("halide_result(%d) = %d instead of %d\n",                       x, halide_result(x), c_result[x]);                return -1;            }        }    }    {        // We can also slice a reduction domain up over multiple        // dimensions at once. This time, we'll compute partial        // histograms over tiles of the domain.        // 在多个维度上同时切片。以tile形式计算每一个tile的直方图,然后汇总统计        Func histogram("hist_rfactor_tile");        histogram(x) = 0;        RDom r(0, input.width(), 0, input.height());        histogram(input(r.x, r.y) / 32) += 1;        // We first split both r.x and r.y by a factor of four.        RVar rx_outer("rx_outer"), rx_inner("rx_inner");        RVar ry_outer("ry_outer"), ry_inner("ry_inner");        histogram.update()            .split(r.x, rx_outer, rx_inner, 4)            .split(r.y, ry_outer, ry_inner, 4);        // We now call rfactor to make an intermediate function that        // independently computes a histogram of each tile.        // 调用rfactor产生一个中间行数,独立地计算每一个tile的直方图。        Func intermediate = histogram.update().rfactor({{rx_outer, u}, {ry_outer, v}});        // We can now parallelize the intermediate over tiles.        // 我们可以在中间tile的中间函数上进行并行化。        intermediate.compute_root().update().parallel(u).parallel(v);        // We also reorder the tile indices outermost to give the        // classic tiled traversal.        // reorder每个tile的外层循环,按照经典的tile模式进行遍历。        intermediate.update().reorder(rx_inner, ry_inner, u, v);        // Vectorize the initializations.        // 初始化过程向量化。        intermediate.vectorize(x, 8);        histogram.vectorize(x, 8);        Buffer<int> halide_result = histogram.realize(8);        // See figures/lesson_18_hist_rfactor_tile.mp4 for a visualization of        // what this does.        // The equivalent C is:        int c_intm[4][4][8];        for (int v = 0; v < input.height() / 2; v++) {            for (int u = 0; u < input.width() / 2; u++) {                for (int x = 0; x < 8; x++) {                    c_intm[v][u][x] = 0;                }            }        }        /* parallel */ for (int v = 0; v < input.height() / 2; v++) {            /* parallel */ for (int u = 0; u < input.width() / 2; u++) {                for (int ry_inner = 0; ry_inner < 2; ry_inner++) {                    for (int rx_inner = 0; rx_inner < 2; rx_inner++) {                        c_intm[v][u][input(u*2 + rx_inner, v*2 + ry_inner) / 32] += 1;                    }                }            }        }        int c_result[8];        for (int x = 0; x < 8; x++) {            c_result[x] = 0;        }        for (int x = 0; x < 8; x++) {            for (int ry_outer = 0; ry_outer < input.height() / 2; ry_outer++) {                for (int rx_outer = 0; rx_outer < input.width() / 2; rx_outer++) {                    c_result[x] += c_intm[ry_outer][rx_outer][x];                }            }        }        // Check the answers agree:        for (int x = 0; x < 8; x++) {            if (c_result[x] != halide_result(x)) {                printf("halide_result(%d) = %d instead of %d\n",                       x, halide_result(x), c_result[x]);                return -1;            }        }    }    printf("Success!\n");    return 0;}

本节主要讲解通过rfactor对约减区域进行拆分并行化。
1.沿y方向拆分
intermediate = histogram.update().rfactor(r.y, y);
intermediate.compute_root().update().parallel(y);
2.沿x方向拆分,并向量化
intermediate = histogram.update().rfactor(r.x, u);
intermediate.compute_root().update().vectorize(u, 8);
3.拆分成tile
histogram.update().split(r.x, rx_outer, rx_inner, 4).split(r.y, ry_outer, ry_inner, 4);
intermediate = histogram.update().rfactor({{rx_outer, u}, {ry_outer, v}});
intermediate.compute_root().update().parallel(u).parallel(v);
intermediate.update().reorder(rx_inner, ry_inner, u, v);

阅读全文
'); })();
0 0
原创粉丝点击
热门IT博客
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 做完leep刀后要卧床吗 女友提刀在门后怎么办 超声刀三个月后效果图 刀塔传奇附魔后进阶返还 妻心刀 妻心似刀 妻心如刀全本 102章全文妻心如刀 妻心如刀番外篇天堂与地狱 妻心如刀全集 妻心如刀改 妻心如刀163全集 心如刀割吉他谱 刀无极 沈傲君未无极刀删减版在线19 太古星帝 烧刀 减星刀怎么合成 星武大帝 一品带刀麻雀 鸡汁刀削面的做法 刀豆腌制 刀芭豆 红刀豆 刀豆怎么腌制好吃又脆 刀豆作品 姜汁刀豆 刀豆土豆 刀豆的做法大全家常炒菜 刀豆价格 炒刀豆的家常做法 香道刀 二把刀串串香 夜刀神十香的漫威之旅 刀板香的做法 密集柜架 密集架柜 厨房柜架 展示架展示柜 文件柜密集架 密集架密集柜 底图柜密集架 密集柜密集架