Convolutional Networks（要点）

来源：互联网发布：阿里云 dns服务器地址编辑：程序博客网时间：2024/06/08 03:58

目录：

Convolutional Networks
- The Convolution Operation
- Motivation
  - Sparse interactions
  - Parameter sharing
  - Equivariant representations
  - The components of a typical convolutional neural network layer
- Pooling
- Convolution and Pooling as an Infinitely Strong Prior
- Variants of the Basic Convolution Function
  - Locally connected layers
  - Tiled convolution
  - The two dimensional single example
- Data Types
- Efficient Convolution Algorithms
- Random or Unsupervised Features

Convolutional Networks

原文

@unpublished{    Bengio-et-al-2015-Book,    title={        Deep Learning    },    author={        Yoshua Bengio and Ian J. Goodfellow and Aaron Courville    },    note={        Book in preparation for MIT Press    },    url={        http://www.iro.umontreal.ca/~bengioy/dlbook    },    year={        2015    }}

A specialized kind of neural network for processing data that has a known, grid-like topology.
- Examples include time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image data, which can be thought of as a 2D grid of pixels.
Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

The Convolution Operation

The convolution operation is typically denoted with an asterisk:
- s(t)=(x∗w)(t)=∫x(a)w(t−a)da
- s[t]=(x∗w)(n)=∑∞a=−∞x[a]w[n−a]
- w needs to be 0 for all negative arguments, or it will look into the future…
In convolutional network terminology, the first argument (the function x) to the convolution is often referred to as the input and the second argument (the function w) as the kernel. The output is sometimesreferred to as the feature map.
If we use a two-dimensional image I as our input, we probably also want to use a two-dimensional kernel K:
- s[i,j]=(I∗K)[i,j]=∑m∑nI[i−m,j−n]K[m,n]
Many machine learning libraries implement cross-correlation but call it convolution:
- s[i,j]=(I∗K)[i,j]=∑m∑nI[i+m,j+n]K[m,n]

Motivation

Convolution leverages three important ideas that can help improve a machine learning system: sparse interactions, parameter sharing and equivariant representations.

Sparse interactions

This is accomplished by making the kernel smaller than the input.
If there are m inputs and n outputs, then matrix multiplication requires m×n parameters and the algorithms used in practice have O(m×n) runtime (per example). If we limit the number of connections each output may have to k, then the sparsely connected approach requires only k×n parameters and O(k×n) runtime.
In a deep convolutional network, units in the deeper layers may indirectly interact with a larger portion of the input.
This allows the network to efficiently describe complicated interactions between many variables by constructing such interactions from simple building blocks that each describe only sparse interactions.

Using the same parameter for more than one function in a model.
As a synonym for parameter sharing, one can say that a network has tied weights.
In a convolutional neural net, each member of the kernel is used at every position of the input.
Convolution is an extremely efficient way of describing transformations that apply the same linear transformation of a small, localregion across the entire input.
In some cases, we may not wish to ~~share parameters~~ across the entire image.
- For example, if we are processing images that are cropped to be centered on an individual’s face, we probably want to extract different features at diﬀerent locations–the part of the network processing the top of the face needs to look for eyebrows, while the part of the network processing the bottom of the face needs to look for a chin.

Equivariant representations

If we apply transformation to x, then apply convolution,the result will be the same as if we applied convolution to x, then applied the transformation to the output.
If we move the object in the input, its representation will move the same amount in the output. Convolution is not equivariant to some other transformations, such as changes in the ~~scale or rotation of an image~~.

The components of a typical convolutional neural network layer

There are two commonly used sets of terminology for describing these layers.
- Left) In this terminology, the convolutional net is viewed as a small number of relatively complex layers, with each layer having many “stages.” In this terminology, there is a one-to-one mapping between kernel tensors and network layers. In this book we generally use this terminology.
- Right) In this terminology, the convolutional net is viewed as a larger number of simple layers; every step of processing is regarded as a layer in its own right. This means that not every “layer” has parameters.

Pooling

A typical layer of a convolutional network consists of three stages.
- In the first stage, the layer performs several convolutions in parallel to produce a set of presynaptic activations.
- In the second stage, each presynaptic activation is run through a nonlinear activation function, such as the rectiﬁed linear activation function. This stage is sometimes called the detector stage.
- In the third stage,we use a pooling function to modify the output of the layer further.
A pooling function replaces the output of the net at a certain location with asummary statistic of the nearby outputs.
- For example, the max pooling operation reports the maximum output within a rectangular neighborhood.
Pooling helps to make the representation become invariant to small translations of the input.
- Invariance to local translation can be a very useful property if we care more about whether some feature is present than exactly where it is.
The use of pooling can be viewed as adding an infinitely strong prior that the function the layer learns must be invariant to small translations.

Convolution and Pooling as an Infinitely Strong Prior

An infinitely strong prior places zero probability on some parameters and says that these parameter values are completely forbidden, regardless of how much support the data gives to those values.
We can think of the use of convolution as introducing an infinitely strong prior probability distribution over the parameters of a layer.
This prior says that the function the convolution layer should learn contains only local interactions and is equivariant to translation.
Likewise, the use of pooling is in infinitely strong prior that each unit should be invariant to small translations.
One key insight is that convolution and pooling can cause underfitting.
- When a task involves incorporating information from very distant locations in the input, then the prior imposed by convolution maybe inappropriate.
- If a task relies on preserving precision spatial information, then using pooling on all features can cause underfitting.

Variants of the Basic Convolution Function

When we refer to convolution in the context of neural networks, we usually actually mean an operation that consists of many applications of convolution in parallel.
- Usually we want each layer of our network to extract many kinds of features, at many locations.
The input is usually not just a grid of real values. Rather, it is a grid of vector-valued observations.
- For example, a color image has a red, green and blue intensity at each pixel. When working with images, we usually think of the input and output of the convolution as being 3-D tensors.
Multi-channel operations are only commutative if each operation has the same number of output channels as input channels.
If Z is produced by convolving K across V without flipping K, then
- Zi,j,k=∑l,m,nVl,j+m,k+nKi,l,m,n
If we want to sample only every s pixels in each direction in the output, then we can deﬁned a downsampled convolution function c such that:
- Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,j×s+m,k×s+nKi,l,m,n]
- We refer to s as the stride of this downsampled convolution.
Zero padding the input allows us to control the kernel width and the size of the output independently.
- Without zero padding, we are forced to choose between shrinking the spatial extent of the network rapidly and using small kernels–both scenarios that significantly limit the expressive power of the network.
- See Fig for an example…
Three special cases of the zero-padding setting are worth mentioning.
- One is the extreme case in which no zero-padding is used whatsoever. In MATLAB terminology, this is called valid convolution.
- Another special case of the zero-padding setting is when just enough zero-padding is added to keep the size of the output equal to the size of the input. MATLAB calls this same convolution.
  - The input pixels near the border influence fewer output pixels than the input pixels near the center.
- The other extreme case, which MATLAB refers to as full convolution, in which enough zeroes are added for every pixel to be visited k times in each direction,resulting in an output image of size m+k−1×m+k−1.
Usually the optimal amount of zero padding (in terms of test set classiﬁcation accuracy) lies somewhere between “valid” and “same” convolution.

Locally connected layers

Every connection has its own weight, speciﬁed by a 6-D tensor W. The indices into W are respectively: i, the output channel, j, the output row, k,the output column, l, the input channel, m, the row offset within the input, and n, the column offset within the input. The linear part of a locally connected layer is then given by
- Zi,j,k=∑l,m,n[Vl,j+m,k+nwi,j,k,l,m,n]
This is sometimes also called unshared convolution, because it is a similar operation to discrete convolution with a small kernel, but without sharing parameters across locations
Locally connected layers are useful when we know that each feature should be a function of a small part of space, but there is no reason to think that the same feature should occur across all of space.

Tiled convolution

Tiled convolution offers a compromise between a convolutional layer and a locally connected layer.
We learn a set of kernels that we rotate through as we move through space. This means that immediately neighboring locations will have different filters, like in a locally connected layer.
To define tiled convolution algebraically, let k be a 6-D tensor, where two of the dimensions correspond to different locations in the output map. Rather than having a separate index for each location in the output map, output locations cycle through a set of t different choices of kernel stack in each direction.
- Zi,j,k=∑l,m,nVl,j+m,k+nKi,l,m,n,j%t,k%t
- If t is equal to the output width, this is the same as a locally connected layer.
It turns out that these three operations–convolution, backprop from out put to weights, and backprop from output to inputs–are sufficient to compute all of the gradients needed to train any depth of feedforward convolutional network.

The two dimensional, single example

Suppose we want to train a convolutional network that incorporates strided convolution of kernel stack K applied to multi-channel image V with stride s isdefined by c(K,V,s).
During forward propagation, we will need to use c itself to output Z, which is then propagated through the rest of the network and used to compute J.
- Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,j×s+m,k×s+nKi,l,m,n]
During backpropagation, we will receive a tensor G such that:
- Gi,j,k=∂∂Zi,j,kJ(V,K).
To train the network, we need to compute the derivatives with respect to the weights in the kernel:
- g(G,V,s)i,j,k,l=∂∂Ki,j,k,lJ(V,K)=∑m,nGi,m,nVj,m×s+k,n×s+l.
If this layer is not the bottom layer of the network, we’ll need to compute the gradient with respect to V in order to backpropagate the error farther down：
- h(K,G,s)i,j,k=∂∂Vi,j,kJ(V,K)
- ∂∂Vi,j,kJ(V,K)=∑l,m|s×l+m=j∑n,p|s×n+p=k∑qKq,i,m,pGi,l,n
We generally add some bias term to each output before applying the non-linearity.
- For convolutional layers, it is typical to have one bias per channel of the output and share it across all locations within each convolution map.
- If the input is of known,fixed size, it is also possible to learn a separate bias at each location of the output map.
- For example, when using implicit zero padding, detector units at the edge of the image receive less total input and may need larger biases.

Data Types

Examples of different formats of data that can be used with convolutional networks:
- Single channel:
  - 1-D => Audio waveform: The axis we convolve over corresponds to time. We discretize time and measure the amplitude of the waveform once per time step.
  - 2-D => Audio data that has been pre-processed with a Fourier trans-form: We can transform the audio waveform into a 2D tensor with different rows corresponding to different frequencies and different columns corresponding to different points in time. Using convolution in the time makes the model equivariant to shifts in time. Using convolution across the frequency axis makes the model equivariant to frequency,so that the same melody played in a different octave produces the same representation but at a different height in the network’s output.
  - 3-D => Volumetric data: A common source of this kind of data is medical imaging technology,such as CT scans.
- Multi-channel:
  - 1-D => Skeleton animation data: Animations of 3-D computer-rendered characters are generated by altering the pose of a “skeleton” over time. At each point in time, the pose of the character is described by a specification of the angles of each of the joints in the character’s skeleton. Each channel in the data we feed to the convolutional model represents the angle about one axis of one joint.
  - 2-D => Color image data: One channel contains the red pixels, one the green pixels, and one the blue pixels. The convolution kernel moves over both the horizontal and vertical axes of the image,conferring translation equivariance in both directions.
  - 3-D => Color video data: One axis cor-responds to time, one to the height of the video frame, and one to the width of the video frame.
One advantage to convolutional networks is that they can also process inputs with varying spatial extents.
- For example if we want to assign a single class label to the entire image. In this case we must make some additional design steps, like inserting a pooling layer whose pooling regions scale in size proportional to the size of the input, in order to maintain a fixed number of pooled outputs.
The use of convolution for processing variable sized inputs only makes sense for inputs that have variable size because they contain varying amounts of observation of the same kind of thing–different lengths of recordings over time,different widths of observations over space, etc.

Efficient Convolution Algorithms

In many cases it is possible to speed up convolution by selecting an appropriate convolution algorithm
Convolution is equivalent to converting both the input and the kernel to the frequency domain using a Fourier transform, performing point-wise multiplication of the two signals, and converting back to the time domain using an inverse Fourier transform.
When a d-dimensional kernel can be expressed as the outer product of d vectors, one vector per dimension, the kernel is called separable. It is equivalent to compose d one-dimensional convolutions with each of these vectors.
- If the kernel is w elements wide in each dimension, then naive multidimensional convolution requires O(wd) runtime and parameter storage space, while separable convolution requires O(w×d) runtime and parameter storage space.

Random or Unsupervised Features

There are two basic strategies for obtaining convolution kernels without supervised training.
- One is to simply initialize them randomly.
- The other is to learn them with an unsupervised criterion.
Random filters often work surprisingly well in convolutional networks.
- Layers consisting of convolution following by pooling naturally become frequency selective and translation invariant when assigned random weights.
An inexpensive way to choose the architecture of a convolutional network:
- First evaluate the performance of several convolutional network architectures by training only the last layer.
- Then take the best of these architectures and train the entire architecture using a more expensive approach.
As with multilayer perceptrons, we use greedy layer-wise unsupervised pretraining, to train the first layer in isolation, then extract all features from the first layer only once, then train the second layer in isolation given those features, and so on.
- The canonical example of this is the convolutional deep belief network.
It is possible to use unsupervised learning to train a convolutional network without ever using convolution during the training process.
- Train a small but densely-connected unsupervised model of a single image patch, then use the weight matrices from this patch-based model to define the kernels of a convolutional layer.
Unsupervised pretraining may offer some regularization relative to supervised training,or it may simply allow us to train much larger architectures due to the reduced computational cost of the learning rule.

0 0

Convolutional Networks（要点）

Convolutional Networks

The Convolution Operation

Motivation

Sparse interactions

Parameter sharing

Equivariant representations

The components of a typical convolutional neural network layer

Pooling

Convolution and Pooling as an Infinitely Strong Prior

Variants of the Basic Convolution Function

Locally connected layers

Tiled convolution

The two dimensional, single example

Data Types

Efficient Convolution Algorithms

Random or Unsupervised Features