CNN 笔记

来源：互联网发布：js移动端日期选择插件编辑：程序博客网时间：2024/06/01 10:27

A ConvNet is made up of Layers. EveryLayer has a simple API: It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters.

We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer and Fully-Connntected Layer

In this way, ConvNets transform the original image layer by layer from original pixel values to the final class scores. Note that some layers contain parameters and other don't. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters(the weights and biases of the neurons).

A ConvNet architecture is a list of Layers that transform the image volume into an output volume(e.g. holding the class scores)

There are a few distinct types of Layers(e.g. CONV/FC/RELU/POOL are by far the most popular)

Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function

Each Layer may or may not have parameters(e.g. CONV/FC do, RELU/POOL don't)

Each Layer may or may not have additional hyperparameters(e.g. CONV/FC/POOL do, RELU don't)

Convolutional Layer

The Conv layer is the core building block of a Convolutional Network, and its output volume can be interpreted as holding neurons arranged in a 3D volume.

(W-F+2P)/S+1 must be integer

Real-world example: The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227*227*3]。On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0.Since (227-11)/4+1=55, and since the Conv layer had a depth of K=96,the Conv layer output volume had size [55*55*96]。Each of the 55*55*96 neurons in this volume was connected to a region of size [11*11*3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11*11*3] region of the input, but of course with different weights.

Parameter Sharing:

Parameter sharing scheme is used in Convolutional Layers to control the number of parameters. Using the real-world example above, we see that there are 55*55*96 = 290400 neurons in the first Conv Layer, and each has 11*11*3 = 363 weights and 1 bias. Together, this add up to 290400 * 364 = 105705600 parameters on the first layer of the ConvNet alone.

It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one patch feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position(x2,y2)。In other words, denoting a single 2-dimensional slice of depth as a depth slice(e.g. a volume of size [55*55*96] has 96 depth slices, each of size [55*55]), we are going to constrain the neurons in each depth slice to use the same weights and bias. With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights(one for each depth slice), for a total of 96*11*11*3=34848, or 34944 parameters(+96 biases). Alternatively, all 55*55 neurons in each depth slice will now be using the same parameters. In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set weights per slice.

Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of CONV layer can in each depth slice be computed as a convolution of the neuron's weights and input volume(Hence the name: Convolutional Layer). Therefore, it is common to refer to the sets of weights as a filter(or a kernel), which is convolved with the input。The result of this convolution is an activation map(e.g. of size [55*55]), and the set of activation maps for each different filter are stacked together along depth dimension to produce the output volume(e.g. [55*55*96])

Example filters by Krizhevsky et al. Each of the 96 filters shown here is of size [11*11*3], and each one is shared by the 55*55 neurons in one depth slice.

Numpy examples: Suppose that the input volume is a numpy array X. Then:

A depth column at position (x,y) would be the activations X[x,y,:]

A depth slice, or equivalently an activation map at depth d would be the activations X[:,:,d]

the Conv Layer:

Accepts a volume of size W1*H1*D1

Requires four hyperparameters:

Number of filters K

their spatial extent F,

the stride S,

the amount of zero padding P

Produces a volume of size W2*H2*D2 where:

W2 = (W1-F+2P)/S+1

H2 = (H1-F+2P)/S+1

D2 = K

with parameter sharing, it introduces F*F*D1 weights per filter, for a total of (F*F*D1)*K weights and K biases

In the output volume, the d-th depth slice(of size W2*H2) is the result of performing a valid convolution of d-th filter over the input volume with a stride of S, ans then offset by d-th bias.

A common setting of the hyperparameters is F=3, S=1, P=1. However, there are common conventions and rules of thumb that motivate these hyperparameters.

Convolution Demo: The input volume is of size W1=5, H1=5, D1=3,and the CONV layer parameters are K=2, F=3, S=2, P=1. That we have two filters of size 3*3（3*3*3）, and they are applied with a stride of 2.

Implementation as Matrix Multiplication:

Note that the convolution operation essentially performs dot products between the filters and local regions of the input. A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows:

The local regions in the input image are stretched out into columns in an operation commonly called im2col.

For example, if the input is [227*227*3] and it is to be convolved with 11*11*3 filters at stride 4, then we would

take [11*11*3] blocks of pixels in the input and stretch each block into a column vector of size 11*11*3=363.

Iterating this process in the input at stride of 4 gives (227-11)/4+1=55 locations along both width and height,

leading to an output matrix X_col of im2col of size [363*3025], where every column is a stretched out receptive

field and there are 55*55=3025 of them in total. Notice that since the receptive fields overlap, every number in

the input volume may be duplicated in multiple distinct columns.

The weights of the CONV layer are similarly stretched out into rows. For example, if there are 96 filters of size

[11*11*3] this would give a matrix W_row of size [96 * 363]

The result of a convolution is now equivalent to performing one large matrix multiply np.dot(W_col, X_col),

which evaluates the dot product between every filter and every receptive field location. In our example, the

output of this operation would be [96*3025], giving the output of dot product of each filter at each location

Pooling Layer:

It is common to periodically insert a Pooling Layer in-between successive Conv Layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representative to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation.The depth dimension remains unchanged. More generally, the pooling layer:

Accepts a volume of size W1*H1*D1

Requires three hyperparameters:

their spatial extent F,

the stride F,

Produce a volume of size W2*H2*D2 where

W2=(W1-F)/S+1

H2=(H1-F)/S+1

D2=D1

Introduces zero parameters since it computes a fixed function of the input

Note that it is not common to use zero-padding for Pooling layers

It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected to a local region in the input, and that many of the neurons in a CONV volume share neurons. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it's possible to convert between FC and CONV layers.

For any CONV layer there is an FC layer that implements the same forward function. The weights matrix would be a large matrix that is mostly zero except for at certain blocks(due to local connectivity) where the weights in many of the blocks are equal(due to parameter sharing)

It is common to pretrain a ConvNet on a very large, and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest.

ConvNet as fixed feature extractor: Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer(this layer's outputs are the 1000 class scores), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset.

Fine-tuning the ConvNet: The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation.

It is possible to first-tune all the layers of the ConvNet, or it's possible to keep some of the earlier layers fixed(due to overfitting concerns) and only fine-tune some higher-level portion of the network.

This is motivated by the observation that the earlier features of a ConvNet contain more generic features that should be useful to many tasks, but layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.

=====================================================================================

Here we will present the implementation details(tricks or tips) in Deep Neural Networks, especially CNN for image-related tasks, mainly in eight aspects: 1)Data augmentation. 2) pre-processing on images 3) initializations of Networks 4) some tips during training 5) selections of activation functions 6) diverse regularizations 7) some insights found from figures and finally 8) methods of ensemble multiple deep networks

0 0