READING NOTE: Rethinking the Inception Architecture for Computer Vision

来源：互联网发布：淘宝直通车没流量编辑：程序博客网时间：2024/05/16 07:13

TITLE: Rethinking the Inception Architecture for Computer Vision

AUTHER: Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna

ASSOCIATION: Google Inc., University College London

FROM: arXiv:1512.00567

CONTRIBUTIONS

Several general and specific design priciples are discussed

Design Choices

General Design Principles

Avoid representational bottlenecks, especially early in the network. One should avoid bottlenecks with extreme compression. In general the representation size should gently decrease from the inputs to the outputs before reaching the final representation used for the task at hand.
Higher dimensional representations are easier to process locally within a network. Increasing the activations per tile in a convolutional network allows for more disentangled features. The resulting networks will train faster.
Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power.
Balance the width and depth of the network.

Sepecific Design Strategy

Factorizing Convolutions with Large Filter Size includes Factorization into smaller convolutions and Spatial Factorization into Asymmetric Convolutions. Both help to improve the speed and the complexity of the learnt function.
Utility of Auxiliary Classifiers act as regularizer rather than help evolving the low-level features. Near the end of training, the network with the auxiliary branches starts to overtake the accuracy of the network without any auxiliary branch and reaches a slightly higher plateau.
Efficient Grid Size Reduction reduces the computational cost while removing the representational bottleneck.

Some Other Ideas

In this paper, a very intereting experiment is of value to be noted. With different perceptive field size, the networks can achieve similar results if similar computational cost is constant.

In my own trials of SSD, I found networks of similar computational cost with differnt perceptive field size have very different result in detection task. For example, Network A has a perceptive field size of 112x112, while Network B is 170x170. Network B has a slightly better performance on classificatino task on Network A. On the contrary, after the two networks are finetuned on 200*200 images on detection task, Network A is better. Thus, how about we train a network with the perceptive field size of, let’s say, 56x56 and finetune it on 100x100 images? Will it have a comparable result?

0 0