bounding box回归的原理学习——yoloV1

来源：互联网发布：雷锋的故事知乎编辑：程序博客网时间：2024/06/06 03:59

参考：
https://zhuanlan.zhihu.com/p/25236464
http://blog.csdn.net/williamyi96/article/details/77530948
http://blog.csdn.net/zijin0802034/article/details/77685438
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/object_localization_and_detection.html

一、逻辑理解

从逻辑上说明对bbox回归的原理的理解。
之前觉得bbox的回归是一个很难理解的地方：这些回归的坐标数字，依据在哪里？
其实回归的输入并不是这些预测的坐标数字，而是预测的坐标对应的feature map中的内容，这个内容与相对于ground truth进行对比，计算，是回归的根本依据。
通过不断的训练，得到了回归的参数，在预测时，网络产生了图像的feature map，对于任意一个预测框，背后对应了一个feature 区域，将学习到的参数与此feature区域进行运算，就会得到调整预测框的参数了。因为同类物体的特征应当相似。

二、yoloＶ1论文学习

(1) resizes the input image to 448 × 448,
(2) runs a single convolutional network on the image, and
(3) thresholds the resulting detections by the model’s confidence.

A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.

对于输入的图片，resize到网络的输入所需要的大小，经过多重卷积处理，得到了7*7大小的feature map。里面的7x7的每个格子就是grid cell。
每个grid cell会负责预测B个bounding box（问题：怎么预测出来的？），以及对于每个类的条件概率。每个bounding box的预测内容包括有：box的信息：x,y,w,h，以及本bounding box包含有某类物体的置信度。含义如下：
—-x,y是相对于grid cell的左上角来说的box的中心位置，w,h是box的宽高，相对于整张图片的款高。
—-confidence的含义：如果这个box包含了任何一种物体，则定义为该box与ground truth中的任何一个box的iou的大小，如果该box不包含任何一个物体，则confidence为0。
—-每个grid cell还会预测C个不同类别的条件概率。即如果这个grid cell是包含有物体的，那么对于可能包含的不同物体的概率情况。
假设每个grid cell会预测2个box，需要检测的物体有20类，那么最后生成的向量size是7x7x(5*2+20)=1470。

loss function如下：
这里写图片描述

训练过程中的回归是如何进行的呢？
基本的逻辑就是对坐标+iou误差、分类误差进行修正。
在训练的时候，每个样本的ground truth的box是知道的，对应于相当于下采样后的的feature map，也能找到在feature map中对应的格子的位置。如果一个grid cell是该物体的中心位置，则这个grid cell负责预测这个物体;每个grid cell都预测了若干个box，其中会有一个box具有与ground truth的某个box有最大的iou，则由这个box负责调整box的位置（另外的预测的box就忽略了）。
可以想象，一开始的时候，权重是随机的，面对抽象出来的图像的feature信息，其预测的box的信息、confidence的信息、物体类别的条件概率都是随机的，经过了loss函数进行惩罚与优化，使得预测行为越来越好。取极端情况为例子加深理解：总是用同一张图片做训练，每次得到的图像的feature map都是一样的。一开始的时候，权重随机，权重与某个grid cell的运算的结果是随机，即每个grid cell对应的5*b+C的数据是随机的。网络对此情况进行优化，指导权重向某个方向优化。第二次看到还是一样的feature map，新的权重与某个grid cell对应的数据进行计算，就会得到更好的box位置和confidence，物体类别的信息。如此反复进行，直到权重参数调整到这样的情况，面对同样的grid cell，其计算出来的box的信息完全与ground box的位置符合，confidence=1，正确类别的条件概率为1。

问题：
可以看到修正box的时候，是针对训练的时候，权重与训练图片中的feature map的相应部分进行运算来进行调整的，即：如果在实际应用中，有一个物体，其与训练的图片中的物体很像，那么可以进行好的预测，如果是以一个没有见过的角度去看，那么由于训练时没有对这个角度的feature进行计算与优化，就不会预测好，跟遇到一个类别之外的物体没差别。所以，增加训练的样本的多角度很有必要。
同时一个grid cell虽然预测若干个box，但只会由那个与ground truth的任何box有最大交集的预测box来负责预测这个grid cell的情况，那么对于比较密集挨在一起物体的情况就会不能预测的现象出现：两只鸟靠的很近，它们的两个ground truth的box，都与某个grid cell有最大的iou，与其他的grid cell都没有相交，但因为这里有两只鸟，所以，这个gridcell预测的若干个box中，总会与两只鸟分别对应的两个ground truth的box的其中一个有更大的iou值，就会选到这个预测box，负责预测其中的一只鸟，对于另一只鸟，就会忽略。而由于鸟比较小，与其他的grid cell也没有相交，所以这另外的鸟就给忽略掉了，检测就会检测不出来。

yolov1源码学习

分析detection_layer.c

void forward_detection_layer(const detection_layer l, network net){    int locations = l.side*l.side;//7x7,side 是detection 层的参数    int i,j;    memcpy(l.output, net.input, l.outputs*l.batch*sizeof(float));    //if(l.reorg) reorg(l.output, l.w*l.h, size*l.n, l.batch, 1);    int b;    if (l.softmax){        for(b = 0; b < l.batch; ++b){            int index = b*l.inputs;            for (i = 0; i < locations; ++i) {                int offset = i*l.classes;                softmax(l.output + index + offset, l.classes, 1, 1,                        l.output + index + offset);            }        }    }    //l.output的组织方式是：    //分成3块：第一块统一放置classes的信息，第二块统一放置每个box的confidence信息，    //第三块统一放置box的坐标信息    if(net.train){        float avg_iou = 0;        float avg_cat = 0;        float avg_allcat = 0;        float avg_obj = 0;        float avg_anyobj = 0;        int count = 0;        *(l.cost) = 0;        int size = l.inputs * l.batch;        memset(l.delta, 0, size * sizeof(float));        //一次训练同时处理batch张图片        for (b = 0; b < l.batch; ++b){            //某一张图片的开始,上面的make_detection_layer方法的assert方法            //验证了inputs的情况            int index = b*l.inputs;            //每张图片到最后的feature map大小是locations=S*S,如（7×7个）            for (i = 0; i < locations; ++i) {                //每个grid cell对应的truth的信息：1个confidence，4个坐标信息，相应的类别信息                int truth_index = (b*locations + i)*(1+l.coords+l.classes);                int is_obj = net.truth[truth_index];                for (j = 0; j < l.n; ++j) {                    //预测的每个box中的confidence信息。参考l.output的组织方式                    int p_index = index + locations*l.classes + i*l.n + j;                    //先默认为该grid cell是no obj的                    l.delta[p_index] = l.noobject_scale*(0 - l.output[p_index]);                    *(l.cost) += l.noobject_scale*pow(l.output[p_index], 2);                    avg_anyobj += l.output[p_index];                }                int best_index = -1;                float best_iou = 0;                float best_rmse = 20;                if (!is_obj){                    //该grid cell没有物体，直接返回                    //其cost就是那个noobject相应的计算结果                    continue;                }                //第i个grid cell的classes预测信息的开始位置,classes的信息统一排在前面                int class_index = index + i*l.classes;                //论文中lossfunction的最后一项：                //计算该grid cell预测出来的每个class的概率信息与真实信息（对应的类别是1,其余为0.可以看ground truth信息组织的代码）的差值                for(j = 0; j < l.classes; ++j) {                    l.delta[class_index+j] = l.class_scale * (net.truth[truth_index+1+j] - l.output[class_index+j]);                    *(l.cost) += l.class_scale * pow(net.truth[truth_index+1+j] - l.output[class_index+j], 2);                    if(net.truth[truth_index + 1 + j]) avg_cat += l.output[class_index+j];                    avg_allcat += l.output[class_index+j];                }                //box的信息排在confidence，classes信息后面                box truth = float_to_box(net.truth + truth_index + 1 + l.classes, 1);                truth.x /= l.side;                truth.y /= l.side;                //判断哪个pred box与ground truth中的任何一个box                //的iou最大，则由该预测box负责对本grid cell                //进行预测                for(j = 0; j < l.n; ++j){                    //参考l.output的组织方式                    //跳过前面的classes和每个box的confidence的数据，到达box的坐标数据位置                    int box_index = index + locations*(l.classes + l.n) + (i*l.n + j) * l.coords;                    box out = float_to_box(l.output + box_index, 1);                    //归一化                    out.x /= l.side;                    out.y /= l.side;                    if (l.sqrt){                        //表示里面的预测值w,h是经过sqrt操作的，                        out.w = out.w*out.w;                        out.h = out.h*out.h;                    }                    float iou  = box_iou(out, truth);                    //iou = 0;                    float rmse = box_rmse(out, truth);                    if(best_iou > 0 || iou > 0){                        if(iou > best_iou){                            best_iou = iou;                            best_index = j;                        }                    }else{                        if(rmse < best_rmse){                            best_rmse = rmse;                            best_index = j;                        }                    }                }                if(l.forced){                    if(truth.w*truth.h < .1){                        best_index = 1;                    }else{                        best_index = 0;                    }                }                if(l.random && *(net.seen) < 64000){                    best_index = rand()%l.n;                }                //得到最佳box                int box_index = index + locations*(l.classes + l.n) + (i*l.n + best_index) * l.coords;                int tbox_index = truth_index + 1 + l.classes;                box out = float_to_box(l.output + box_index, 1);                out.x /= l.side;                out.y /= l.side;                if (l.sqrt) {                    out.w = out.w*out.w;                    out.h = out.h*out.h;                }                float iou  = box_iou(out, truth);                //printf("%d,", best_index);                int p_index = index + locations*l.classes + i*l.n + best_index;                //上面加了默认的无物体的情况                *(l.cost) -= l.noobject_scale * pow(l.output[p_index], 2);                *(l.cost) += l.object_scale * pow(1-l.output[p_index], 2);                //这个avg_obj没看到减去之前默认加的值                avg_obj += l.output[p_index];                //预测的每个box的confidence的值代表的是                //有多肯定这个box包含了一个物体,按论文的定义是                //期望的等于预测的box与ground truth box的iou.                //那么走l.rescore的分支是比较合理的。                l.delta[p_index] = l.object_scale * (1.-l.output[p_index]);                if(l.rescore){                    l.delta[p_index] = l.object_scale * (iou - l.output[p_index]);                }                l.delta[box_index+0] = l.coord_scale*(net.truth[tbox_index + 0] - l.output[box_index + 0]);                l.delta[box_index+1] = l.coord_scale*(net.truth[tbox_index + 1] - l.output[box_index + 1]);                l.delta[box_index+2] = l.coord_scale*(net.truth[tbox_index + 2] - l.output[box_index + 2]);                l.delta[box_index+3] = l.coord_scale*(net.truth[tbox_index + 3] - l.output[box_index + 3]);                if(l.sqrt){                    l.delta[box_index+2] = l.coord_scale*(sqrt(net.truth[tbox_index + 2]) - l.output[box_index + 2]);                    l.delta[box_index+3] = l.coord_scale*(sqrt(net.truth[tbox_index + 3]) - l.output[box_index + 3]);                }                *(l.cost) += pow(1-iou, 2);                avg_iou += iou;                ++count;            }        }        if(0){            float *costs = calloc(l.batch*locations*l.n, sizeof(float));            for (b = 0; b < l.batch; ++b) {                int index = b*l.inputs;                for (i = 0; i < locations; ++i) {                    for (j = 0; j < l.n; ++j) {                        int p_index = index + locations*l.classes + i*l.n + j;                        costs[b*locations*l.n + i*l.n + j] = l.delta[p_index]*l.delta[p_index];                    }                }            }            int indexes[100];            top_k(costs, l.batch*locations*l.n, 100, indexes);            float cutoff = costs[indexes[99]];            for (b = 0; b < l.batch; ++b) {                int index = b*l.inputs;                for (i = 0; i < locations; ++i) {                    for (j = 0; j < l.n; ++j) {                        int p_index = index + locations*l.classes + i*l.n + j;                        if (l.delta[p_index]*l.delta[p_index] < cutoff) l.delta[p_index] = 0;                    }                }            }            free(costs);        }        *(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2);        printf("Detection Avg IOU: %f, Pos Cat: %f, All Cat: %f, Pos Obj: %f, Any Obj: %f, count: %d\n", avg_iou/count, avg_cat/count, avg_allcat/(count*l.classes), avg_obj/count, avg_anyobj/(l.batch*locations*l.n), count);        //if(l.reorg) reorg(l.delta, l.w*l.h, size*l.n, l.batch, 0);    }}

这里需要对l.output的数据组织方式进行了解，否则阅读其取各种值的代码会无法理解。l.output是包含一批图片的输出结果，每一张的输出紧跟着前一张的输出的后面，对于一张图片的输出数据其组织方式如下：
1、第一部分是每个grid cell预测的指定class的条件概率：
一个grid cell如果预测20个class，则排列方式就是第一个grid cell预测的class_0,class_1..class_19的条件概率值，紧接着是第二个grid cell的预测值，直到所有的grid cell的对于各类别的条件概率值都排列完毕；
2、第二部分是每个grid cell预测的每个box的confidence的值：
同样，如果一个grid cell预测2个box，则就是confidence_box_0,confidence_box_1….
3、第三部分就是预测的各个box的坐标信息了：
按照x,y,w,h的方式排列。

整体的检测流程步骤很清晰：
1、循环处理一批中的每一张图片，对于每一张图片，进行以下处理：
2、先默认当做该grid cell 不包含物体，计算loss情况：lossfunction中的第4项；
3、取出对应的grid cell的truth box的数据
4、计算条件概率的loss情况，该数据只与grid cell有关，与预测的box无关，就是lossfunction中的最后一项
5、找出该grid cell预测的最好的那个box，叫做best_box
6、利用best_box计算box坐标的loss情况：loss function中的前两项，以及confidence的loss情况：loss function中的第3项；

如果某个grid cell不包含任何物体，其loss部分只有一个noobj的；否则会包含其他部分。

对着源码与论文，基本理解了其计算思路，但是对于里面如何进行优化的过程，还不是很清楚，主要是对于反向传播的代码级别的理解不够。

阅读全文

1 0