论文笔记：Feature Pyramid Networks for Object Detection

来源：互联网发布：hebe掰弯selina知乎编辑：程序博客网时间：2024/05/17 02:43

论文地址：https://arxiv.org/pdf/1612.03144.pdf

初衷

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale,pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost.

简而言之就是特征金字塔一直很好用，只是费时费力，我们提出了一直很好的方法可以高效地使用特征金字塔。

介绍

一些用feature pyramid的方法：

用图片金字塔生成特征金字塔
只在特征最上层预测
特征层分层预测
FPN从高层携带信息传给底层，再分层预测

We rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections .

作者阐释他这样做的好处，就是融合底层细节和高层概括。

Similar architectures adopting top-down and skip connections are popular in recent research. Their goals are to produce a single high-level feature map of a fine resolution, on which the predictions are to be made.

很多人多想到过融合，但是他们没有分层预测。。。。

Feature Pyramid Networks

目标

Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion.

实现

Bottom-up pathway

Specifically, for ResNets we use the feature activations output by each stage’s last residual block. We denote the output of these last residual blocks as {C 2 , C 3 , C 4 , C 5 } for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of {4, 8, 16, 32} pixels with respect to the input image. We do not include conv1 into the pyramid due to its large memory footprint.

自底向上的过程就是神经网络普通的正向传播。考虑到后的步骤，要逐层抽取特征。那些feature map大小相同的地方，只抽取最上层。在试验中，用的ResNets的卷积层，分为了{C1，C2，C3，C4，C5}用了C2，C3，C4，C5的feature map。不用C1，因为太大了。

Top-down pathway and lateral connections

这里写图片描述

With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsampled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition.

上层feature map上采样和下层一样大，下层的经过1×1卷积核使得维度和上层一样，之后按元素相加。

To start the iteration, we simply attach a 1×1 convolutional layer on C 5 to produce the coarsest resolution map.

为了开始这个操作，我们的最上面的feature map是通过1×1的卷积核生成的。

Finally, we append a 3×3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. This final set of feature maps is called {P 2 , P 3 , P 4 , P 5 }, corresponding to {C 2 , C 3 , C 4 , C 5 } that are respectively of the same spatial sizes.

为了减少混叠效应带来的影响，每层还经过一个3×3的卷积核生成最后的feature map，记作{P 2 , P 3 , P 4 , P 5 }。

应用

RPN

We adapt RPN by replacing the single-scale feature map with our FPN. We attach a head of the same design (3×3 conv and two sibling 1×1 convs) to each level on our feature pyramid.

将原来的小网络接在每一层上。参数共享

Because the head slides densely over all locations in all pyramid levels, it is not necessary to have multi-scale anchors on a specific level. Instead, we assign anchors of a single scale to each level. Formally, we define the anchors to have areas of {322,642,1282,2562,5122 } pixels on {P 2 , P 3 , P 4 , P 5 , P 6 } respectively. 1 As in [28] we also use anchors of multiple aspect ratios {1:2, 1:1, 2:1} at each level. So in total there are 15 anchors over the pyramid.

anchor 的长宽比就不用改了，因为每层都不一样大。其实这样有15anchors，比之前的9个还要多呢。

Fast R-CNN

选取合适大小的feature map。通过下面的式子计算用哪一层的特征：

k = ⌊ k 0 + l o g 2 (w h ‾ ‾ ‾ \sqrt / 224) ⌋

k0是开始设定的，就是没用这种方法的时候该用哪层。

0 0