Going Deeper with Convolutions 阅读笔记

来源:互联网 发布:淘宝卖家店铺出租 编辑:程序博客网 时间:2024/05/29 17:58

Going Deeper with Convolutions 阅读笔记

原文链接:https://arxiv.org/pdf/1409.4842.pdf

Abstract

Propose a deep convolutional neural network architecture codenamed Inception.

The main hallmark of this architecture is the improved utilization of the computing resources inside the network.

Increase the depth and width of the network while keeping the computational budget constant.

Architectural decisions are based on the Hebbian principle and the intuition of multi-scale processing.

Note:
Hebbian theory:
A theory in neuroscience that propose an explanation for the adaption of neurons in the brain during the learning process, describing a basic mechanism for synaptic plasticity, where an increase in synaptic efficacy arises from the presynaptic cell’s repeated and persistent stimulation of postsynaptic cell(from wikipedia)

Introduction

Uses 12 times fewer parameters the AlexNet.

This architecture can efficiently run on mobile and embedded device by saving their power and memory use.

A standard structure - stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers.

Max-pooling layers result in loss of accurate spatial information.

Inspired by a neuroscience model of the primate visual cortex, use a series of fixed Gabor filters of different sizes to handle multiple scales. All filters in the Inception architecture are learned. Inception layers are repeated many times(22 layers in GoogLeNet).

Network-in-Network is an approach to increase the representational power of neural networks. Addtional 1x1 convolutional layers was added into the network to increase its depth.Dual propose of using 1x1 convolutional layers:

  • Dimension reduction modules to remove computational bottlenecks.
  • Increase the depth and width of network without a significant performance penalty.

R-CNN decomposes the overall detection problem into two subproblem:

  • Ultilize low-level cues such as color and texture in order to generate object location proposals in category-agnostic fashion.
  • Use CNN classifiers to identify object categories at those locations.

Inception architecture improves both above stages:

  • multi-box prediction for higher object bounding box recall.
  • ensemble approaches for better categorization of bounding box proposals.

Note:
1.Local Contrast Normalization

Enforcing a sort of local competition between adjacent features in a feature map and between features at the same spatial location in different feature maps.

  • local substractive normaliztion
    v_{ijk} = x_{ijk} - \sum_{ipq} w_{pq} * x_{i,j+p,k+q}
    vijk=xijkipqwpqxi,j+p,k+q

    w_{pq}wpq: Gaussian weighting window, \sum w_{pq} = 1wpq=1
  • local divisive normalization
    y_{ijk} = x_{ijk}/\max(c, \sigma_{jk})
    yijk=xijk/max(c,σjk)

    σjk=(ipqwpqvi,j+p,k+q)12

    c=max(σjk)

reference: What is the Best Multi-Stage Architecture for Object Recognition?

2.Gabor filter
In image processing, a Gabor filter is a linear filter used for edge detection.

impulse response:
Complex:
complex
Real:
这里写图片描述
Imaginary:
Alt text
where:
Alt text
Alt text
reference:wikipedia

Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is increasing its depth and width. It leads to:

  • Network with large number of parameters is more prone to overfitting especially if the number of labeled examples in the training dataset is limited.
  • Computational resources are finite.

A fundamental solution is sparsity and replacing the fully connected layers by the sparse ones, even inside the convolutions.

If the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs.

Today’s computing infrastructure are inefficient when it comes to numerical calculation on non-uniform sparse data especially beacuse of using numerical libraries allowing for extremely fast dense matrix multiplication.Most current utilization of sparsity is employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layers. Trandition ConvNets use random and sparse connections. Current state-of-art architectures for CV have uniform structure. Large number of filters and greater batch size help.

Inception tends to exploit advantages of both sparse structure and dense matrix multiplication.

Cluster sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication.

Note:
1.Dense and sparse matrix
If most elements of a matrix is zero, this matrix is sparse.
If most elements of a matrix is nonzero, this matrix is dense.
reference: wikipedia

Architectural Details

main idea: finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily avaiable dense components.

Translation invariance means whole network can be built from covlutional building blocks. Find the optimal local construction and repeat it spatially.

We should analyze the correlated statistics of the last layer and cluster them into groups of units with high correlation.

In lower layers, correlated units are more concentrated. This means that we can cover them with 1x1 convolutions. However, there can be some spread out clusters so we should use convolutional layers over lager patches. There will be a decreasing number of patched over larger and lager regions.

Current Inception architecture are restricted to filter size 1x1, 3x3 and 5x5.

Conbine all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage.

这里写图片描述
The ratio of 3x3 and 5x5 convolutions should increase as we move to higher layers because correlated units’ concentration is expected to decrease.

However, the computational budget blows up as the number of Inception module increases. This leads to the second idea: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise.

Ways of dimension reduction:

  • 1x1 convolutions
  • max-pooling

Benefit:

  • It allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity.
  • Process visual information at various scales and then aggregated

With careful manual design, its speed can be 2-3x faster than architecture without Inception module.

It is suggested to start using Inception modules only at higher layers.

GoogLeNet

这里写图片描述

All the convolutions, including those inside the Inception modules, use rectified linear activation.

The size of the receptive field in our network is 224×224 taking RGB color channels with mean subtraction.

“#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions.

Use average pooling and an extra linear layer before classifier, this extra layer makes adapt and fine-tune our networks for other labels easily and it doesn’t affect performance significantly.

Add auxiliary classfier connected to intermediate layers to encourage discrimination in the lower stages, increase the gradient signal that gets propagated back, and provide additional regularization. During training, auxiliary classifier loss is added to total loss with a weight 0.3. During inference time, they are discarded.

Extra network:

  • An average pooling layer with 5×5 filter size and stride 3, resulting in an 4×4×512 output
    for the (4a), and 4×4×528 for the (4d) stage.
  • A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation.
  • A fully connected layer with 1024 units and rectified linear activation.
  • A dropout layer with 70% ratio of dropped outputs.
  • A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the
    main classifier, but removed at inference time).

Training Methodogy

DistBelief, CPU based

ILSVRC 2014 Classification Challenge Setup and Results

这里写图片描述

  • Independently train 7 versions of the same GoogLeNet model(including a wider version). They only differ in sampling methodologies and random order in which they see input images.
  • More aggressive cropping approach than AlexNet. Resize picture to 4 scales(shorter dimension: 256, 288, 320, 352), take the left, center, right square of them(portrait images: top, center, bottom).For each square, take 4 corners and center 224x224 crop as well as square resized to 224x224 and their mirrored versions. 4x3x6x2 = 144 crops per image.
  • softmax probabilities are averaged over multiple crops and all the individual classifiers to obtain the final prediciton.

ILSVRC 2014 Detection Challenge Setup and Results

Alt text

Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50 percents.

Extraneous detections count as false positives.

The approach taken by GoogLeNet for detection is similar to the R-CNN but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the Selective Search [20] approach with multi-box predictions for higher object bounding box recall.

Superpixel size was increased by 2x to cut down the number of false positives.

Ensemble of 6 ConvNets when classifying.

No bounding box regerssion.

No use of external data.

Conclusion

I think the most important idea of Inception architecture is:
Approximating the expected optimal sparse structure by readily available dense building blocks.

原创粉丝点击