[Read Paper] Improving neural networks by preventing co-adaptation of feature detectors

来源：互联网发布：java 文件上传原理编辑：程序博客网时间：2024/05/01 20:06

Title: Improving neural networks by preventing co-adaptation of feature detectors

Authors: G.E. Hinton, N. Srivastava, A. Krizhevsky et al

摘要：When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This “overfitting” is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random “dropout” gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

全文链接：http://arxiv.org/pdf/1207.0580.pdf

Note：

Main Point：引入dropout来降低过拟合（overfitting）

为什么dropout可以降低过拟合：
[1] On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present.
[2] Another way to view the dropout procedure is as a very efficient way of performing model averaging with neural networks.

A good way to reduce the error on the test set is to average the predictions produced by a very large number of different networks. The standard way to do this is to train many separate networks and then to apply each of these networks to the test data, but this is computationally expensive during both training and testing. Random dropout makes it possible to train a huge number of different networks in a reasonable time. There is almost certainly a different network for each presentation of each training case but all of these networks share the same weights for the hidden units that are present.

In networks with a single hidden layer of N units and a “softmax” output layer for computing the probabilities of the class labels, using the mean network is exactly equivalent to taking the geometric mean of the probability distributions over labels predicted by all 2^N possible networks. Assuming the dropout networks do not all make identical predictions, the prediction of the mean network is guaranteed to assign a higher log probability to the correct answer than the mean of the log probabilities assigned by the individual dropout networks. Similarly, for regression with linear output units, the squared error of the mean network is always better than the average of the squared errors of the dropout networks.

1 0