[深度学习论文笔记][Visualizing] Visualizing and Understanding Convolutional Networks

来源：互联网发布：vr合成软件编辑：程序博客网时间：2024/05/22 02:25

Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolutional networks.” European Conference on Computer Vision. Springer International Publishing, 2014.(Citations: 1207).

Occlusion Experiments

Idea Occlude portions of the input image, revealing which parts of the scene are important for classification.

Method Occlude different portions of the input image with a grey square, and monitor the probability output of correct class of the classifier, plot as a function of the position of the grey square in the original image.

Result See Fig. 4.1. It can be seen that model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded. In the third image, if we occlude the person’s head, the probability of the correct class goes up.

Deconv Approach

DeconvNet

For the relu layer

The backward pass is

Method For each layer, random select a subset of feature maps. For each feature map, find the top 9 neurons that have the highest activations. Projecting each separately down to pixel space by deconvnet reveals the different structures that excite the a given feature map.

Result Can be seen in Fig. 4.2, 4.3, 4.4. Alongside these visualizations we show the corresponding image patches.

• The the strong grouping within each feature map.
• Hierarchical nature of the features in the network (layer 2: corners and other edge/color conjunctions; layer 3: textures, mesh patterns (r1, c1), and text (r2, c4); layer 4: more class-specific, like dog faces (r1, c1) and bird’s legs (r4, c2); layer 5: entire objects, like keyboards (r1, c11) and dogs (r4)).

• Greater invariance at higher layers.

• Exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, r1, c1).

Feature Evolution During Training The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.

Feature Invariance Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasi-linear for translation and scaling. However, the output is not invariant to rotation.

References
[1]. M. Zeiler. https://www.youtube.com/watch?v=ghEmQSxT6tw.

[2]. F.-F. Li, A. Karpathy, and J. Johnson. http://cs231n.stanford.edu/slides/winter1516_lecture9.pdf.

0 0