Region Proposal Network

来源：互联网发布：几岁能开淘宝店铺编辑：程序博客网时间：2024/05/12 19:17

Training of RNP

网络结构

这里写图片描述

rpn_conv/3x3的bottom layer是conv5_3, 因为stride = 1, pad =1, num_output=512, 经过3×3的卷积后得到一个与conv5_3大小完全相同的feature map: rpn/output = relu(rpn/output). 在rpn/output的每一个点上都会预测9个proposal. 具体作法是将当前点映射回原图, 以其为中心取3×3个不同大小, 不同长宽比的矩形区域, 即anchor. 然后在anchor上回归
rpn/output之上有两个header:
- rpn_cls_score用于评分, num_output=18, 对应对9个proposal的object/background评分.
- rpn_bbox_pred用于回归bbox, num_output=36, 对应9个proposal的bbox位置与大小. 也就是说, 每一种size+aspect ratio 的anchor都有各自的4个kernel用于预测bbox的位置.

训练数据

data: anchor
label: object/background, groundtruth bbox.

object/background class label

有两种anchor属于object类型(正样本):
1. 与任意一个groundtruth bbox的IOU大于0.7
2. 有的ground truth bbox没有符合条件1的anchor, 则取与它有最大IOU的anchor
有一种anchor属于background(负样本): 与任意groundtruth bbox的IOU都小于0.3
其余的anchor在计算loss时忽略, 即不参与训练.

bbox label

只有正样本anchor才有bbox label, 并且它的label为让它成为正样本的groundtruth bbox, 需要注意的是bbox label里的值都是与对应anchor的相对位置信息:

    targets_dx = (gt_ctr_x - ex_ctr_x) / ex_widths    targets_dy = (gt_ctr_y - ex_ctr_y) / ex_heights    targets_dw = np.log(gt_widths / ex_widths)    targets_dh = np.log(gt_heights / ex_heights)

(from fast_rcnn.bbox_transform.py)

Loss function

Loss 由两部分组成, 分类loss: Lcls和回归loss:Lreg.
这里写图片描述

值得注意的几点:

Lreg为L1SmoothLoss.
p∗i: 正样本为1, 负样本为0. 也就是说负样本不参与Lreg 的计算
λ: Faster RCNN里的默认值取10, 目的是为了让Lreg与Lreg在loss的比例接近.
Ncls与Nreg用于normalization, 通常情况下它们是不相等的. 不过Faster RCNN论文里也说了, 这个normalization是可有可无的, 可被省略. 当然了, 省略后的learning rate也相应得变小.

Faster RCNN里对应的代码实现

`AnchorTargetLayer`

layer {  name: 'rpn-data'  type: 'Python'  bottom: 'rpn_cls_score'  bottom: 'gt_boxes'  bottom: 'im_info'  bottom: 'data'  top: 'rpn_labels'  top: 'rpn_bbox_targets'  top: 'rpn_bbox_inside_weights'  top: 'rpn_bbox_outside_weights'  python_param {    module: 'rpn.anchor_target_layer'    layer: 'AnchorTargetLayer'    param_str: "'feat_stride': 16"  }}

这一层是RPN的关键所在, 它从图片信息和conv5_3中产生anchor数据及标签.
有四个输入(bottom), 但第一个rpn_cls_score只使用了它的shape信息, 并不参与其他计算.
它产生的四个输出:

rpn_labels: shape = (1, 1, 9 * h_conv5_3, w_conv5_3). 正样本的label为1, 负样本为0. 其余的为-1. 超出图片范围的anchor也为负样本.
rpn_bbox_targets: shape = (1, 4 * 9, h_conv5_3, w_conv5_3)
rpn_bbox_inside_weights: shape同rpn_bbox_targets, 相当于公式(1)里的p∗.
rpn_bbox_outside_weights: shape同rpn_bbox_targets, 相当于公式(1)里的1Nreg(但值是其四分之一, 用于将回归loss的四部分之和规范化).

rpn_cls_loss

Lcls在网络中对应rpn_cls_loss节点:

layer {  name: "rpn_loss_cls"  type: "SoftmaxWithLoss"  bottom: "rpn_cls_score_reshape"  bottom: "rpn_labels"  propagate_down: 1  propagate_down: 0  top: "rpn_cls_loss"  loss_weight: 1  loss_param {    ignore_label: -1    normalize: true  }}

它接收两个输入: rpn_labels与rpn_cls_score_reshape. 后者由前面提到过的rpn_cls_score reshape而来. RPN训练只支持batch_size = 1(意思是图片的数量, 而不是anchor的数量.), rpn_cls_score的形状为: (1,18,hconv5_3,wconv5_3). 对其进行reshape操作的layer定义为:

layer {   bottom: "rpn_cls_score"   top: "rpn_cls_score_reshape"   name: "rpn_cls_score_reshape"   type: "Reshape"   reshape_param { shape { dim: 0 dim: 2 dim: -1 dim: 0 }}   }

也就是说, reshape之后得到的rpn_cls_score_reshape的shape为:
(1, 2, 9 * h_conv5_3, w_conv5_3)
而rpn_labels的shape为:
(1, 1, 9 * h_conv5_3, w_conv5_3)
这样reshape的目的是为了方便在SoftmaxLoss的normalization操作: 计算得到的总loss需要乘以:1shape[0]∗shape[2]∗shape[3], 分母为所有anchor的数量.

rpn_loss_bbox

Lreg对应的节点:

layer {  name: "rpn_loss_bbox"  type: "SmoothL1Loss"  bottom: "rpn_bbox_pred"  bottom: "rpn_bbox_targets"  bottom: 'rpn_bbox_inside_weights'  bottom: 'rpn_bbox_outside_weights'  top: "rpn_loss_bbox"  loss_weight: 1  smooth_l1_loss_param { sigma: 3.0 }}

它也是接收四个输入, 如何参数意义见SmoothL1Loss.

训练过程

End-to-End BP
同一个mini-batch里的anchor来自于同一张图片. 从一张图片里随机取256个anchor, 让正负样本的比例达到1:1. 如果正样本的数量少于128, 则用负样本代替.

RPN In Faster-RCNN

从cls_bbox_pred 到proposal

RPN输出的rpn_cls_score_reshape经过Softmax及Reshape之后得到rpn_cls_prob_reshape, 加上rpn_bbox_pred经过ProposalLayer之后才能得到Fast RCNN可用的Proposal. 这是Faster RCNN的又一个自定义python layer(前面AnchorTargetLayer也是).

layer {  name: 'proposal'  type: 'Python'  bottom: 'rpn_cls_prob_reshape'  bottom: 'rpn_bbox_pred'  bottom: 'im_info'  top: 'rpn_rois'#  top: 'rpn_scores'  python_param {    module: 'rpn.proposal_layer'    layer: 'ProposalLayer'    param_str: "'feat_stride': 16"  }}

那么在这一层里有些什么操作呢?
先弄清它的输入与输出的格式.

输入
- rpn_cls_prob_reshape: shape = (1, 18, h, w).(h,w)即之前的(h_conv5_3, w_conv5_3), 此处还是简写一下吧.
- rpn_bbox_pred: shape = (1, 36, h, w)
- im_info: 图片信息, 例如长,宽.
输出
- rpn_rois: shape = (n_proposals, 5). 第一列代表proposal所属的image在batch中的index, 所以全为0.

上一层的输出在内容上表示为conv5_3上的每一个点上的9个anchor都预测了它的object/background score与bbox, 有很大可能出现相互重合或包含的bbox, 肯定需要一个筛选操作. ProposalLayer完成的主要工作有两个:
1. 前面提到过, rpn_bbox_pred包含的都是proposal相对于anchor的相对位置信息, 所以需要根据它从anchor中得到相对于图片的位置信息: 见fast_rcnn.bbox_transform.bbox_transform.bbox_transform_inv方法
2. 利用NMS, non-maximum suppression在重叠的bbox筛选出得分最高的bbox.
3. 其他筛选.

0 0