bounding box的回归原理学习——yoloV2

来源：互联网发布：mac os x 10.9 百度云编辑：程序博客网时间：2024/06/18 08:47

yolo v1的关于box的变化之处

yolov2相对有yolov1来说，重点提了以下改进：
1、增加了bn等图像预处理，按照论文是对mAP有2%的提升作用。按照斯坦福大学的cs231n的教程的说法，图像预处理中，Bn是推荐要做的。
2、在训练方式上，也进行了改进。
yolov1是用小分辨率224x224的网络在imagenet上进行训练，在检测的时候，将网络的接收输入调整为448x448。
Yolov2是先在imagenet上微调（如何？）448x448的分类网络，用相对高分辨率的网络来训练，前面的特征提取层也会适应高分辨率的图片。然后再对检测网络进行微调。论文提到这个可以提升4%的mAP。
3、还对box的回归方式做了更改，为此去除了全连接层。yolov1中是直接由网络对box进行最原始的学习，不停的优化直到达到优化目标为止。而yolov2借用了faster rcnn的思路，采用了先验的box的大小进行学习。即先选定box的形状，然后在每个grid cell中都按照这些先验的形状进行box预测，然后进行调整。如果这些先验的box的大小选取合理的话，网络可以很快学习到满足目标的参数，即收敛快。
按照论文的说法是:Predicting offsets instead of coordinates simplifies the problem and makes it easier for the network to learn.
4、预测输出有变化：
yolov1中，是每个grid cell负责预测各个类别的条件概率，而每个box会有coord和confidence值。
在yolov2中，每个box都会有coords、confidence、和各类别的条件概率。

anchor box的意义

anchor box是一种先验知识，对训练的样本进行统计，取前面数量最多的几个形状作为anchor box。由于数据来源于训练样本，所以若每个grid cell都按此进行预测，则会基本囊括最有可能出现的情况，回召率会相对较高。论文说到的accuracy会有一些下降，可能是因为这些box的大小都是比较固定的，对于某些形状不在这top n范围内的物体，用不一样的形状来预测，其得到的iou相对会差一些，这个可以理解。
如果实际应用的场景中，物体的分布与训练时候的大小是基本一致的，则用这些anchor box作为先验知识，训练出来的网络是确实有实际意义的。

源码分析

yolov2的检测基本思路

yolo中，除了预测了box的坐标信息以外，还有对应的类别以及confidence的值。
yolov2中，具体的含义是：
When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and
instead predict class and objectness for every anchor box. Following YOLO, the objectness prediction still predicts the IOU of the ground truth and the proposed box and the class predictions predict the conditional probability of that class given that there is an object.

region_layer.c 分析

1、数据结构

同样的，在阅读代码之前，先要对l.output的数据组织形式有了解（我也是通过分析代码得到的）。
把所有的grid cell预测的若干个box来作为分割段：
第一个数据段是所有grid cell的第0个box;
第二个数据段是所有grid cell的第1个box;
依次类推;
在每个数据段内，数据的排列又是这样的：
先排所有box的x数据，然后是y，接着是w和h和confidence。
最后的样子,假设output feature map 是 2x2的，每个cell预测2个box：
xxxxyyyywwwwhhhhccccxxxxyyyywwwwhhhhcccc

可以想象：最后的feature map是一个很多层的立方体，长宽分别为l.w*l.h，深度为:
depth=l.n*(l.coords+l.classes+1)
l.n是每个grid cell预测的box的数量;
l.coords是坐标数量，为4。（源码中还有不是4的if语句，未深究）
l.classes就是要检测的物体的类别数量;
数字1就是表示confidence的数值。

相当于把这个深度为depth的立方体，切成depth个面，然后把这个面，拉成一条；一条接一条就成了l.output输出的数据形式了。

2、源码

void forward_region_layer(const layer l, network net){    int i,j,b,t,n;    memcpy(l.output, net.input, l.outputs*l.batch*sizeof(float));#ifndef GPU    for (b = 0; b < l.batch; ++b){        for(n = 0; n < l.n; ++n){            int index = entry_index(l, b, n*l.w*l.h, 0);            activate_array(l.output + index, 2*l.w*l.h, LOGISTIC);            index = entry_index(l, b, n*l.w*l.h, l.coords);            if(!l.background) activate_array(l.output + index,   l.w*l.h, LOGISTIC);        }    }    if (l.softmax_tree){        int i;        int count = l.coords + 1;        for (i = 0; i < l.softmax_tree->groups; ++i) {            int group_size = l.softmax_tree->group_size[i];            softmax_cpu(net.input + count, group_size, l.batch, l.inputs, l.n*l.w*l.h, 1, l.n*l.w*l.h, l.temperature, l.output + count);            count += group_size;        }    } else if (l.softmax){        int index = entry_index(l, 0, 0, l.coords + !l.background);        softmax_cpu(net.input + index, l.classes + l.background, l.batch*l.n, l.inputs/l.n, l.w*l.h, 1, l.w*l.h, 1, l.output + index);    }#endif    memset(l.delta, 0, l.outputs * l.batch * sizeof(float));    if(!net.train) return;    float avg_iou = 0;    float recall = 0;    float avg_cat = 0;    float avg_obj = 0;    float avg_anyobj = 0;    int count = 0;    int class_count = 0;    *(l.cost) = 0;    //l.output的数据组织方式如下：    //把所有的grid cell预测的若干个box来作为分割段：    //第一个数据段是所有grid cell的第0个box;    //第二个数据段是所有grid cell的第1个box;    //依次类推;    //在每个数据段内，数据的排列又是这样的：    //先排所有box的x数据，然后是y，接着是w和h和confidence。    //最后的样子,假设output feature map 是 2x2的，每个cell预测2个box：    //xxxxyyyywwwwhhhhccccxxxxyyyywwwwhhhhcccc    for (b = 0; b < l.batch; ++b) {        if(l.softmax_tree){            int onlyclass = 0;            for(t = 0; t < 30; ++t){                box truth = float_to_box(net.truth + t*(l.coords + 1) + b*l.truths, 1);                if(!truth.x) break;                int class = net.truth[t*(l.coords + 1) + b*l.truths + l.coords];                float maxp = 0;                int maxi = 0;                if(truth.x > 100000 && truth.y > 100000){                    for(n = 0; n < l.n*l.w*l.h; ++n){                        int class_index = entry_index(l, b, n, l.coords + 1);                        int obj_index = entry_index(l, b, n, l.coords);                        float scale =  l.output[obj_index];                        l.delta[obj_index] = l.noobject_scale * (0 - l.output[obj_index]);                        float p = scale*get_hierarchy_probability(l.output + class_index, l.softmax_tree, class, l.w*l.h);                        if(p > maxp){                            maxp = p;                            maxi = n;                        }                    }                    int class_index = entry_index(l, b, maxi, l.coords + 1);                    int obj_index = entry_index(l, b, maxi, l.coords);                    delta_region_class(l.output, l.delta, class_index, class, l.classes, l.softmax_tree, l.class_scale, l.w*l.h, &avg_cat);                    if(l.output[obj_index] < .3) l.delta[obj_index] = l.object_scale * (.3 - l.output[obj_index]);                    else  l.delta[obj_index] = 0;                    l.delta[obj_index] = 0;                    ++class_count;                    onlyclass = 1;                    break;                }            }            if(onlyclass) continue;        }        for (j = 0; j < l.h; ++j) {            for (i = 0; i < l.w; ++i) {                for (n = 0; n < l.n; ++n) {                    //某个grid cell的预测box的数据的开始位置                    int box_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 0);                    box pred = get_region_box(l.output, l.biases, n, box_index, i, j, l.w, l.h, l.w*l.h);                    float best_iou = 0;                    for(t = 0; t < 30; ++t){                        box truth = float_to_box(net.truth + t*(l.coords + 1) + b*l.truths, 1);                        if(!truth.x) break;                        float iou = box_iou(pred, truth);                        if (iou > best_iou) {                            best_iou = iou;                        }                    }                    int obj_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, l.coords);                    avg_anyobj += l.output[obj_index];                    l.delta[obj_index] = l.noobject_scale * (0 - l.output[obj_index]);                    if(l.background) l.delta[obj_index] = l.noobject_scale * (1 - l.output[obj_index]);                    if (best_iou > l.thresh) {                        l.delta[obj_index] = 0;                    }                    if(*(net.seen) < 12800){                        box truth = {0};                        truth.x = (i + .5)/l.w;                        truth.y = (j + .5)/l.h;                        truth.w = l.biases[2*n]/l.w;                        truth.h = l.biases[2*n+1]/l.h;                        delta_region_box(truth, l.output, l.biases, n, box_index, i, j, l.w, l.h, l.delta, .01, l.w*l.h);                    }                }            }        }        for(t = 0; t < 30; ++t){            box truth = float_to_box(net.truth + t*(l.coords + 1) + b*l.truths, 1);            if(!truth.x) break;            float best_iou = 0;            int best_n = 0;            i = (truth.x * l.w);            j = (truth.y * l.h);            //printf("%d %f %d %f\n", i, truth.x*l.w, j, truth.y*l.h);            box truth_shift = truth;            truth_shift.x = 0;            truth_shift.y = 0;            //printf("index %d %d\n",i, j);            for(n = 0; n < l.n; ++n){                int box_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 0);                box pred = get_region_box(l.output, l.biases, n, box_index, i, j, l.w, l.h, l.w*l.h);                if(l.bias_match){                    pred.w = l.biases[2*n]/l.w;                    pred.h = l.biases[2*n+1]/l.h;                }                //printf("pred: (%f, %f) %f x %f\n", pred.x, pred.y, pred.w, pred.h);                pred.x = 0;                pred.y = 0;                //这样强行把x,y变成0,会不会导致计算的iou不准确？                float iou = box_iou(pred, truth_shift);                if (iou > best_iou){                    best_iou = iou;                    best_n = n;                }            }            //printf("%d %f (%f, %f) %f x %f\n", best_n, best_iou, truth.x, truth.y, truth.w, truth.h);            int box_index = entry_index(l, b, best_n*l.w*l.h + j*l.w + i, 0);            float iou = delta_region_box(truth, l.output, l.biases, best_n, box_index, i, j, l.w, l.h, l.delta, l.coord_scale *  (2 - truth.w*truth.h), l.w*l.h);            if(l.coords > 4){                int mask_index = entry_index(l, b, best_n*l.w*l.h + j*l.w + i, 4);                delta_region_mask(net.truth + t*(l.coords + 1) + b*l.truths + 5, l.output, l.coords - 4, mask_index, l.delta, l.w*l.h, l.mask_scale);            }            //如果iou大于0.5,则认为成功找回这个目标            if(iou > .5) recall += 1;            avg_iou += iou;            //l.delta[best_index + 4] = iou - l.output[best_index + 4];            int obj_index = entry_index(l, b, best_n*l.w*l.h + j*l.w + i, l.coords);            avg_obj += l.output[obj_index];            l.delta[obj_index] = l.object_scale * (1 - l.output[obj_index]);            if (l.rescore) {                l.delta[obj_index] = l.object_scale * (iou - l.output[obj_index]);            }            if(l.background){                l.delta[obj_index] = l.object_scale * (0 - l.output[obj_index]);            }            int class = net.truth[t*(l.coords + 1) + b*l.truths + l.coords];            if (l.map) class = l.map[class];            int class_index = entry_index(l, b, best_n*l.w*l.h + j*l.w + i, l.coords + 1);            delta_region_class(l.output, l.delta, class_index, class, l.classes, l.softmax_tree, l.class_scale, l.w*l.h, &avg_cat);            ++count;            ++class_count;        }    }    //printf("\n");    *(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2);    printf("Region Avg IOU: %f, Class: %f, Obj: %f, No Obj: %f, Avg Recall: %f,  count: %d\n", avg_iou/count, avg_cat/class_count, avg_obj/count, avg_anyobj/(l.w*l.h*l.n*l.batch), recall/count, count);}

box get_region_box(float *x, float *biases, int n, int index, int i, int j, int w, int h, int stride){    box b;    //i就是论文中的Cx。    //预测的输出:x[index+0*stride]就是相对于grid cell的左上角的水平偏移量    //最后做归一化。以下都是按照论文的公式返回的。    b.x = (i + x[index + 0*stride]) / w;    b.y = (j + x[index + 1*stride]) / h;    b.w = exp(x[index + 2*stride]) * biases[2*n]   / w;    b.h = exp(x[index + 3*stride]) * biases[2*n+1] / h;    return b;}int entry_index(layer l, int batch, int location, int entry){    int n =   location / (l.w*l.h);    int loc = location % (l.w*l.h);    return batch*l.outputs + n*l.w*l.h*(l.coords+l.classes+1) + entry*l.w*l.h + loc;}float delta_region_box(box truth, float *x, float *biases, int n, int index, int i, int j, int w, int h, float *delta, float scale, int stride){    box pred = get_region_box(x, biases, n, index, i, j, w, h, stride);    float iou = box_iou(pred, truth);    float tx = (truth.x*w - i);//truth box的相对grid cell左上角的x    float ty = (truth.y*h - j);//truth box的相对grid cell左上角的y    //对照着get_region_box的b.w,b.h的处理方法来看，将他们统一到一个表示方法上    float tw = log(truth.w*w / biases[2*n]);    float th = log(truth.h*h / biases[2*n + 1]);    delta[index + 0*stride] = scale * (tx - x[index + 0*stride]);    delta[index + 1*stride] = scale * (ty - x[index + 1*stride]);    delta[index + 2*stride] = scale * (tw - x[index + 2*stride]);    delta[index + 3*stride] = scale * (th - x[index + 3*stride]);    return iou;}

计算流程与yolov1类似，都是围绕loss function，得到相关的delta。

二、faster rcnn相关细节学习
1、论文学习

2、源码学习

阅读全文

1 0