faster-rcnn 原理解析

来源：互联网发布：ubuntu找root分区编辑：程序博客网时间：2024/06/08 12:35

这个笔记只写了一部分，并不完全，我会抓紧时间写完它。
以下都是本人的个人见解与总结，不妥之处欢迎指正。
（一）anchor的作用
Q1：为什么不直接回归bbox的坐标，要用anchor来辅助？
A：因为自然图片中物体的大小是在几个常用的范围之内的，faster-rcnn中的9个anchor的大小，基本上就是自然图片中大多数物体的大小范围，所以找出与物体gt最接近的anchor，并且微调这个anchor，让其接近物体的gt(真实框的大小)，更省时间，更准确。直接回归bbox坐标的话，会很慢。

Q2：RPN 网络的损失函数 loss function:
对于一张image来说：（以下这个loss的计算单位是anchor，因为是对于一张image来说的，一张image保留了_C.TRAIN.RPN_BATCHSIZE = 256个anchors）

L=1N cls ∑ n i=0 L cls (p i ,p ∗ i )+λ1N reg ∑ n i=0 p ∗ i L reg (t i ,t ∗ i ) 公式（1）

其中，i是一个mini-batch中的anchor的index，（这个mini-batch=N cls =256，是256个anchors，不是指256张image）

第一个p ∗ i   ：在程序里面对应的参数是rpn_labels，与anchors一一对应，p ∗ i =1 ，表示该anchor是正样本；p ∗ i =0 ，表示该anchor是负样本；
第二个p ∗ i   ：在实际的程序里面，对应的是这个参数rpn_bbox_inside_weight，意思与第一个p ∗ i  一样。
t ∗ i   ：在程序里面对应的参数是rpn_bbox_targets
t i   ：就是RPN网络的回归层的直接输出：anchors的4个偏移量：t x ,t y ,t w ,t h   。

The classification loss L cls is log loss over two classes(object vs. not object)
The regression loss L reg (t i ,t ∗ i )=smooth L1 (t i ,t ∗ i )
p ∗ i L reg (t i ,t ∗ i ) 表示，只有当anchor is positive( p i =1 ) 时，the regression loss is activated.

t x =x−x a w a ,t y =y−y a h a ,t w =log n ww a ,t h =log n hh a 公式（2）
t ∗ x =x ∗ −x a w a ,t ∗ y =y ∗ −y a h a ,t ∗ w =log n w ∗ w a ,t ∗ h =log n h ∗ h a 公式（3）

其中，x ∗  代表ground-truth box 的中心点坐标x （y ∗ ,w ∗ ,h ∗  一样）
x a  代表anchor box的中心点坐标x （y a ,w a ,h a  一样）
x 代表the predicted box的中心点坐标x (y,w,h一样 )（注意：这个并不是网络的直接输出！！！）
网络的直接输出是t x ,t y ,t w ,t h

Q1:首先，对于一幅任意大小的图片P×Q，缩放至固定大小，短边长度固定至600，然后按照相同的比例缩放长边.

将短边缩放至600的原因：论文中，设定的面积最大的正方形的anchor的边长为512，为了确保这个anchor能基本上覆盖住图片，所以图片的大小最好跟最大的anchor的边长相近，那么就是600喽～

Q2:base-network可以用VGG16，ZF等。

ZF：输出特征图大小40×60×256；
VGG16：13个conv，13个relu，4个pooling，输出特征图大小40×60×512（以下都以VGG16为例）
VGG16
VGG16

注意：所有的conv层都是kernel_size=3,stride=1,pad=1（conv层不改变图片大小），
计算公式：P 经过conv之后得到的图片的宽 =[P+2pad−kernel] 向下取整 stride +1 ；
所有的pooling层都是kernel_size=2，stride=2（只有pooling层改变图片大小）,经过4个pooling层，最后conv5_3层输出的feature map大小是原图的1/16；
1、pad是怎么进行填充0的？（在原图的最外边填充一圈0/在原图的最外边填充一圈和图像边界值一样的值）
2、经过4个pooling层，特征图大小变为原图的1/16，这个是感受野吗？（不是，这个只是尺寸上的比较）指的是特征图上的每一个位置，对应于原图上的一个16*16的区域吗？（不是。感受野才是这个概念）如果不是感受野，感受野又是什么概念？怎么计算？

（三）RPN部分

1、在conv5之后，又跟了一个‘rpn_conv/3x3’层（3x3卷积，pad=1,stride=1,num_output=512,不改变原图尺寸），相当于每个点又融合了周围3x3的空间信息（猜测这样做也许更鲁棒？反正我没测试），同时512-d不变。这时候feature map的大小为51×39×512。

2、生成anchor
2.1、anchor尺寸的设置和生成

针对51×39×512的feature map中的每一个位置（1×1×512，每一个位置是一个深度为512的向量），都会生成k=9个anchor，这9个anchor的尺寸是这样规定的：

1、anchor_width:anchor_height = {1:2,1:1,2:1}
2、anchor_area = {128*128，256*256，512*512}

注意：这个anchor的尺寸是在原图上的尺寸，并不是在51×39×512的feature map上的尺寸，这在Faster-RCNN_TF/lib/rpn_msr/generate_anchors.py这段代码中可以体现出来（源码在Faster-RCNN_TF）：

def generate_anchors(base_size=16, ratios=[0.5, 1, 2],                     scales=2**np.arange(3, 6)):    """    Generate anchor (reference) windows by enumerating aspect ratios X    scales wrt a reference (0, 0, 15, 15) window.    """    base_anchor = np.array([1, 1, base_size, base_size]) - 1    ratio_anchors = _ratio_enum(base_anchor, ratios)    anchors = np.vstack([_scale_enum(ratio_anchors[i, :], scales)                         for i in xrange(ratio_anchors.shape[0])])    return anchors  # anchors大小为[9,4],9个anchors,4个通道分别是对应anchor的左上角坐标x1,y1和右下角坐标x2,y2

这段代码的作用就是生成9个anchors（针对feature map上的一个点）：
base_size=16：51×39×512的feature map在尺寸上来说，是原图的116  ;
原图是1000*600,51=100016 ,39=60016  ；这就是base_size代表的意思。这个就对应上了原图的尺寸。
ratios=[0.5, 1, 2]：就是anchor_width:anchor_height = {1:2,1:1,2:1} ;
scales=2**np.arange(3,6)：有的文章中会提到这个scales，np.arange(3,6)=(3,4,5)，这个scales= {2 3 ,2 4 ,2 5  }={8,16,32}，指的是 anchor的边长（在feature map上的尺寸）；
那么 anchor的边长（在原图上的尺寸） = {base_size×8，base_size×16，base_size×32}={16×8，16×16，16×32}={128，256,512}。

2.2、对anchor要做什么

在 51×39×512 的feature map中，每个点（共51×39个，每一个点是512-d的）产生k=9个anchor，现在要对每个anchor做2件事情：

1、分类：对每个anchor进行分类，判断它是前景还是背景，是前景的得分score是多少？是背景的得分score又是多少？
每个anchor要分foreground 和background ，故每个anchor有2个score；
51×39×512 的feature map中，每个点有k=9个anchor，所以每个点，对应生成（2×k）个score；
就是每个点，由512-d feature转化为cls=2k scores,

在tensorflow版本的源码Faster-RCNN_TF中，在Faster-RCNN_TF/lib/networks/VGGnet_train.py中，有
这里写图片描述
这一段显示的就是RPN中对anchor进行分类的这一操作，在工程上的实现方法：

self.feed(‘conv5_3’)：表示接下来的conv层的输入是‘conv5_3’，也就是VGG16第conv5_3层的输出：51×39×512 的feature map，在tensorflow中的大小为[batchsize,h,w,channel]=[batchsize,39,51,512]；

.conv(3,3,512,1,1,name=’rpn_conv/3x3’)：该层是一个conv层，在VGG16的conv5_3层之后。3x3卷积核，num_output=512，pad=1，stride=1，不改变输入尺寸。该层的输出叫‘rpn_conv/3x3’，大小[batchsize,h,w,channel]=[batchsize,39,51,512]；

.conv(1,1,len(anchor_scales)*3*2 ,1 , 1, padding=’VALID’, relu =False,name=’rpn_cls_score’))
：这一层就是生成anchor scores的层了，也是一个conv层，
1*1的卷积核，num_output=len(anchor_scales)*3*2=3*3*2=2k=18，pad=1，stride=1。其中，anchor_scales=[8, 16, 32]，是论文设定的，则len(anchor_scales)=3，这个scales的意思在2.1节有讲。该层输出叫‘rpn_cls_score’，大小[batchsize,h,w,channel]=[batchsize,39,51,18]。

2、回归：对每个anchor进行回归，让其尽可能接近 gt 。
每个anchor都有[x,y,w,h]4个偏移量coordinates；
51×39×512 的feature map中，每个点有k=9个anchor，所以每个点，对应生成（4×k）个偏移量
就是每个点，由512-d feature转化为reg=4k coordinates。
那么，对于整张feature map 来说，就是由原来的 51×39×512 变成 51×39×4k，在工程上，用一个1×1×4k的卷积滤波器来实现。

这2件事情是通过两个全连接层（工程上，用的是1×1的卷积层来实现）来实现的，一个全连接层实现一种功能，2个并列。就局部来说，这两层是全连接网络；就全局来说，由于网络在所有位置（共51*39个）的参数相同，所以实际用尺寸为1×1的卷积网络实现。

51×39×512 的feature map，会产生51×39×k~20000个anchor>（注意：这些anchor的尺寸都是在原图上的尺寸，是相对于原图来说的）。但是我们对这20000个anchor都进行分类和回归的话，工作量太大，所以，我们会在合适的anchors中只随机选取128个postive anchors+128个negative anchors进行。

那，什么是合适的anchors？

2.2、筛选合适的anchors

1、跨越图像边界的anchors弃去不用
将近20000个anchor会有很多是超出原图像边界的，弃去这些anchor。这步会舍弃大概2/3的anchor，剩下约20000*1/3～6000个anchor。
2、为上一步所剩的anchors打标签
将上一步骤剩余的大约6000个anchors，分成正样本(positive anchor）和负样本（negative anchor），正anchor对应的label=1，负anchor对应的label=0。这个过程就是为这6000多个anchor打标签的过程。
规定：
a）.对每个标定的真值候选区域gt，与每个anchor计算重叠比例(IOU)，IOU最大的anchor记为前景样本fg
b）.对a)剩余的anchor，如果其与某个gt重叠比例(IOU)大于0.7，记为前景样本fg；如果其与任意一个gt的重叠比例(IOU)小于0.3，记为背景样本bg
c）. 对a),b)剩余的anchor（与gt的IOU在[0.3，0.7]之间的），弃去不用，对应的label = -1。

经过以上步骤，此时生成的anchor.shape=[6000,4]（6000代表6000个anchor，4代表4个通道，分别是anchor左上角和右下角顶点的坐标[x1,y1,x2,y2]），则对应生成的标签label.shape=[6000,1]，

2、在正负anchor里面随机选256个anchor
现在，我们已经有了正负anchors，接下来，在这些正负anchors里随机选出128个正anchor和128个负anchor。如果正anchor不足128个，就用负anchor补齐。

所以说，对于一张image，我们最终要进行分类和回归的anchor，就只有这256个。

2.4、分类和回归anchor，工程实现
Faster-RCNN_TF/lib/networks/VGGnet_train.py，这个文件里面定义了网络的整体结构。下面是它里面的一个代码段，是RPN部分的结构定义：
这里写图片描述
conv5_3：VGG16卷积层conv5_3的输出，大小[batchsize，H，W，512]，这里的batchsize是指有batchsize张image；
rpn_conv/3×3：在conv5之后，又跟了一个‘rpn_conv/3x3’层（3x3卷积，pad=1,stride=1,num_output=512,不改变原图尺寸），相当于每个点又融合了周围3x3的空间信息（猜测这样做也许更鲁棒？反正我没测试），同时512-d不变。输出的feature map的大小还是[batchsize，H，W，512]。
rpn_cls_score：这个就是RPN网络分类层的直接输出，是每一个anchor的得分。大小[batchsize，H，W，2*8]。上一层的feature map上的每一个点，有9个anchor（总共有H×W×9～20000个anchor），每个anchor有2个得分（是正样本的得分和是负样本的得分），所以每一个点有2×9=18个得分。此处是用一个卷积层来得到这些得分的。
rpn_data：这个输出里面有4部分值:
rpn_data[0]：rpn_labels，大小[20000,1]。就是rpn_loss里面的第一个p ∗ i   ，与anchors一一对应，是anchors的标签，1代表该anchor是正样本，0代表该anchor是负样本，-1代表该anchor是要舍弃不用的（注意，虽然这些anchor是要被舍弃的，但是在原始的anchors这个矩阵里，这些要被舍弃的anchor还是存在的，并没有变成0，只是给打了个标签，表明是要被舍弃，但此时并没有被真正舍弃掉）。大小[20000,1]，H×W×9～20000。这大约20000个label里面有128个1——代表fg，正anchor；128个0——代表bg，负anchor，其余都是-1。
rpn_data[1]：rpn_bbox_targets，就是rpn_loss里面的t ∗ i   ，
rpn_data[2]：rpn_bbox_inside_weight，大小[20000,4]。就是rpn_loss里面的第二个p ∗ i   ，与anchors和rpn_labels也是相对应的，rpn_labels=1的地方它=cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHT（这个值默认=1，是可以调节的，也可以不等于1,），但是其他地方都=0（rpn_labels有= -1的，这里面没有）。
rpn_data[3]：rpn_bbox_outside_weight，这个在rpn_loss里面并没有显示的写出来。
rpn_bbox_pred：是RPN网络的回归层的直接输出：anchors的4个偏移量：t x ,t y ,t w ,t h   。大小[batchsize，H，W，4*9]，这里的batchsize是指有batchsize张image。

问：网络为什么不直接回归出框的x,y,w,h，而是回归出框相对于anchor的偏移量t x ,t y ,t w ,t h ？
答：因为如果网络直接回归出框的x,y,w,h，那么，anchor就像相当于没有使用，设定这个anchor就没有意义了。那为什么非要设定这个anchor呢？假如不用anchor，网络直接回归出框的x,y,w,h，那么VGG16的conv5_3输出的feature map上，每一个点就只能回归出一个框，在feature map是固定尺寸的情况下，就不能进行多尺度的回归。如果使用anchor，每一个点会回归出9个不同尺寸的anchor，而这9个anchor的weight是互相独立的，所以能在即使feature map是固定尺寸的情况下，也能predict boxes of various size。

3、RPN部分的损失函数

RPN_loss = 分类loss（rpn_cross_entropy ）+ 回归loss（rpn_loss_box）

在源码Faster-RCNN_TF中的Faster-RCNN_TF/lib/fast_rcnn/train.py里面，有RPN_loss的定义：
对于batchsize张images（这里的batchsize就是image的batchsize了，在程序里面，计算的是batchsize张image的总loss）：

http://blog.csdn.net/u013252298/article/details/68961927
bbox_outside_weights该权值用来设置在所有样本中，positive和negitive的权值。由于上述所有操作都是在没有越界的anchor中进行的，所以需要还原回到所有的anchors中。于是使用方法_unmap。

（四）生成proposals的部分：proposal_layer_tf.py

该层的作用就是，根据rpn分类层输出的rpn_cls_score，筛选rpn回归层输出的rpn_bbox_pred。把是fg的部分筛选出来，并进行一系列后处理（NMS之类的）。

1、生成proposals的网络结构搭建
首先，拿到rpn分类层的直接输出——rpn_cls_score，计算其softmax值。这2个reshape是为了配合计算softmax用的。

(self.feed('rpn_cls_score')            .reshape_layer(2,name = 'rpn_cls_score_reshape')            .softmax(name='rpn_cls_prob'))(self.feed('rpn_cls_prob')            .reshape_layer(len(anchor_scales)*3*2,name = 'rpn_cls_prob_reshape'))

我们现在有了rpn_cls_prob_reshape，然后再结合rpn回归层的直接输出rpn_bbox_pred和原图片信息im_info，生成rpn_rois。

(self.feed('rpn_cls_prob_reshape','rpn_bbox_pred','im_info')             .proposal_layer(_feat_stride, anchor_scales, 'TRAIN',name = 'rpn_rois'))  # rpn_rois就是最终提取出来的感兴趣的区域

那这个proposal_layer层，具体是怎么生成rpn_rois的呢？

2、生成最终的proposals，也就是rpn_rois

2.1、生成初始的proposals

proposal_layer层是在Faster-RCNN_TF/lib/rpn_msr/proposal_layer_tf.py中进行定义的。主要是用到了里面的proposal_layer（）函数。打开文件proposal_layer_tf.py，我们来按顺序分析一下：

首先，再次生成anchors。方法与RPN网络生成anchors一样。

_anchors = generate_anchors(scales=np.array(anchor_scales))  # 此时只是对于一个点生成了9个anchor。anchor的尺寸是在feature map上的尺寸,大小为[9,4],4个通道分别代表左上角和右下角的坐标值x1,y1,x2,y2_num_anchors = _anchors.shape[0]  # 9

# 1. Generate proposals from bbox deltas and shifted anchorsheight, width = scores.shape[-2:]if DEBUG:   print 'score map size: {}'.format(scores.shape)# Enumerate all shiftsshift_x = np.arange(0, width) * _feat_stride # _feat_stride=16shift_y = np.arange(0, height) * _feat_strideshift_x, shift_y = np.meshgrid(shift_x, shift_y)shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),                    shift_x.ravel(), shift_y.ravel())).transpose()# Enumerate all shifted anchors:## add A anchors (1, A, 4) to# cell K shifts (K, 1, 4) to get# shift anchors (K, A, 4)# reshape to (K*A, 4) shifted anchorsA = _num_anchorsK = shifts.shape[0]anchors = _anchors.reshape((1, A, 4)) + \          shifts.reshape((1, K, 4)).transpose((1, 0, 2))anchors = anchors.reshape((K * A, 4))  # [20000,4] 此时生成的anchors是整张feature map上的所有的anchors,并且这些anchors的尺寸是在原图上的尺寸。

上面这两块代码，是生成anchors用的。

然后，拿到rpn_cls_prob_reshape和rpn_bbox_pred，但是我们只用是fg的部分。

# the first set of _num_anchors=9 channels are bg probs# the second set are the fg probs, which we wantscores = rpn_cls_prob_reshape[:, _num_anchors:, :, :]bbox_deltas = rpn_bbox_pred

其次，利用生成的anchors（[20000,4]）和bbox_deltas，生成最初的proposals。

# Convert anchors into proposals via bbox transformations# 利用anchors（[20000,4]）和rpn回归层的直接输出bbox_deltas（[:,4]），生成proposals# 这时候的proposals使用的是所有的初始的anchors([20000,4]),也就是说proposals还没有经过任何处理呢# 返回的proposals是rpn网络预测出来的框[20000,4],4个通道分别表示左上角和右下角的坐标x1,y1,x2,y2proposals = bbox_transform_inv(anchors, bbox_deltas)

这个bbox_deltas，在数值上，是rpn回归层的直接输出rpn_bbox_pred，但是在大小上进行了reshape操作，此时的大小为[batchsize,H,W]其4个通道分别是下面公式中的t x ,t y ,t w ,t h
t x =x−x a w a ,t y =y−y a h a ,t w =log n ww a ,t h =log n hh a 公式（2）
矩阵proposals 里面的值，就是回归出来的框的坐标了，4各通道分别是上面公式中的x,y,w,h ，这个就是我们实际需要得到的。

2.2、裁掉超出原图片边界的proposals

# 2. clip predicted boxes to image# 裁掉超出原图片边界的proposalsproposals = clip_boxes(proposals, im_info[:2])

2.3、裁掉高和宽< mini_size=16的proposals

#  3.remove predicted boxes with either height or width < threshold# (NOTE: convert min_size to input image scale stored in im_info[2])# proposals 的高和宽必须要>min_size=16keep = _filter_boxes(proposals, min_size * im_info[2])proposals = proposals[keep, :]scores = scores[keep]

2.4、对scores进行排序，由大到小，相对应的选出前pre_nms_topN=6000个proposals
这里的scores是rpn分类层的直接输出。只取出了是fg的部分。2.1有讲。

# 4.sort all (proposal, score) pairs by score from highest to lowest（对scores进行排序,由大到小）# 5.take top pre_nms_topN (e.g. 6000) （取出前6000个对应的proposals）order = scores.ravel().argsort()[::-1]if pre_nms_topN > 0:    order = order[:pre_nms_topN]proposals = proposals[order, :]  # 留下scores在前pre_nms_topN个proposalscores = scores[order]

2.5、进行NMS
进行NMS，并且根据NMS之后的fg softmax scores 再排一次序，相对应的提取出前post_nms_topN=300个proposals。

# 6. apply nms (e.g. threshold = 0.7)# 7. take after_nms_topN (e.g. 300)# 8. return the top proposals (-> RoIs top)keep = nms(np.hstack((proposals, scores)), nms_thresh)if post_nms_topN > 0:   keep = keep[:post_nms_topN]proposals = proposals[keep, :]scores = scores[keep]

此时得到的proposals就是RPN生成的感兴趣的区域——rpn_rois了，是300个框（的坐标），并不是区域内的特征值。

阅读全文

0 0