COURSE 2 Improving Deep Neural Networks Hyperparameter tuning, Regularization and Optimization

来源：互联网发布：台州首届网络风云榜编辑：程序博客网时间：2024/06/05 05:54

Week1

Train/Dev/Test sets

train set: train model (60% or higher)
dev set: hold-out cross validation
test set: take the best model

Make sure dev set and test set come from same distribution

Not having a test set might be okey

Bias and Variance

high bias: underfitting
just right
high variance: overfitting

bias and variance

When it comes high bias or high variance, we need to see the optimal error (base error).

Basic “recipe” for machine learning

high bias -> bigger network
high variance -> more data

recipe

Norm Regularization

One of the first things you should try to solve a high variance problem probably regularization

min w, b J (w, b) = min w, b 1 m \sum i = 1 m ℓ (y ̂ (i), y) + λ 2 m | | w | | 22

We always omit

λ 2 m | | b | | 2

because w is usually a pretty high dimensional parameter vector, especially with a high variance problem.

Different Regularization

L2 regularization (most often)
$| | w | | 22 = \sum j = 1 n x w 2 j = w T w$
L1 regularisation (more zeors and more sparse)
$| | w | | 1 = \sum j = 1 n x | w j |$
Frobenius norm regularization (the sum of square of elements of a matrix)
$| | w [l] | | 2 F = \sum i = 1 n [l - 1] \sum j = 1 n [l] (w [l] i j) 2$

Derivatives

d w [l] = 1 m d Z [l] A [l - 1] T + λ m w [l]

Process

w [l] : = w [l] - α d w [l] = w [l] - α (1 m d Z [l] A [l - 1] T + λ m w [l]) = (1 - α λ m) w [l] - α (1 m d Z [l] A [l - 1] T)

L2 regularization is sometimes called weight decay because the coefficient of w is going to be a little bit less than 1.

Why Regularization Reduces Overfitting

If the regularization becomes very large, the parameters W very small, so Z will be relatively small, kind of ignoring the effects of b for now, so Z will be relatively small or, really, I should say it takes on a small range of values. And so the activation function if is tanh, say, will be relatively linear. And so your whole neural network will be computing something not too far from a big linear function which is therefore pretty simple function rather than a very complex highly non-linear function. And so is also much less able to overfit.

And you might not see a decrease monotonically on cost function.

Dropout Regulazation

With dropout, what we’re going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network.

Train

Suppose

d [3] = n p . r a n d o m . r a n d (a [3] . s h a p e [0], a [3] . s h a p e [1]) < k e e p . p r o b

Then

a [3] = a [3] \circ d [3] a [3] / = k e e p . p r o b

Because

z [4] = w [4] a [3] + b [4], where a [3] decreases by 20% randomly

So it needs to divede by the keep.prob to make z not reduced.

This is inverted dropout.

Test

No dropout.

z [1] = W [1] a [0] + b [1] a [1] = g [1] (z [1]) z [2] = W [2] a [1] + b [2] a [2] = g [2] (z [2]) . . .

Why Does Dropout Work

Cannot rely on any one feature, so have to spread out weights (shrink weights)

Data Augmentation

This can be an inexpensive way to give your algorithm more data and therefore sort of regularize it and reduce over fitting. And by synthesizing examples like this what you’re really telling your algorithm is that If something is a cat then flipping it horizontally is still a cat.

data augmentation

Early Stopping

The main downside of early stopping is that this couples these two tasks. So you no longer can work on these two problems independently, because by stopping gradient decent early, you’re sort of breaking whatever you’re doing to optimize cost function J, because now you’re not doing a great job reducing the cost function J.

early stopping

Nomalizing Inputs

Substract mean:

μ = 1 m \sum i = 1 m x (i) x : = x - μ

Normalize variance:

σ 2 = 1 m \sum i = 1 m x (i) 2 x / = σ 2

And use the same parameters to normalize test set

Why Nomalize Inputs

If you normalize the features, then your cost function will on average look more symmetric. And if you’re running gradient descent on the cost function, then you might have to use a very small learning rate because if you’re here that gradient descent might need a lot of steps to oscillate back and forth before it finally finds its way to the minimum. Whereas if you have a more spherical contours, then wherever you start gradient descent can pretty much go straight to the minimum.

Vanishing / Exploding Gradients

In a very deep network,

if w [L] > I, then the w [l] will grow exponentially if w [L] < I, then the w [l] will decrease exponentially

If your activations or gradients increase or decrease exponentially as a function of L, then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially smaller than L, then gradient descent will take tiny little steps. It will take a long time for gradient descent to learn anything.

Weight Initialization Optimisation for Deep Network

W [l] = n p . r a n d o m . r a n d n (s h a p e) * 2 n [ l - 1 ] ‾ ‾ ‾ ‾ ‾ ‾ \sqrt, where g (z [z]) is the ReLU function W [l] = n p . r a n d o m . r a n d n (s h a p e) * 1 n [ l - 1 ] ‾ ‾ ‾ ‾ ‾ ‾ \sqrt, where g (z [z]) is the tanh function W [l] = n p . r a n d o m . r a n d n (s h a p e) * 2 n [ l - 1 ] + n [ l ] ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt, where g (z [z]) is the tanh function

Gradient Checking

for each i

d θ [i] a p p r o x = J ( θ 1 , θ 2 , . . . , θ i + ε , θ i + 1 . . . ) - J ( θ 1 , θ 2 , . . . , θ i - ε , θ i + 1 . . . ) 2 ε \approx d θ [i]

check

| | d θ [ i ] a p p r o a x - d θ | | 2 | | d θ [ i ] a p p r o a x | | 2 + | | d θ | | 2 \approx 10 - 7, where ε = 10 - 7

Notes

Don’t use in training - only to debug
If algorithm fails grad check, look at components to try to identify bug
Remember regularization
Doesn’t work with dropout
Run at random initialization

Week2

Mini-Batch Gradient Descent

You split up your training set into smaller, little baby training sets and these baby training sets are called mini-batches.
$$
X = [x^{(1)}, x^{(2)}, … , x^{(i)} …, x^{(m)}], Y = [y^{(1)}, y^{(2)}, … , y^{(i)} …, y^{(m)}] \

\text{mini-batches: }
X = [X^{{1}}, X^{{2}}, …, X^{{t}}, …], Y = [Y^{{1}}, Y^{{2}}, …, Y^{{t}}, …], \text{where } X^{{t}}, Y^{{t}} is a mini-batch
$$
mini-batch gradient descent

The code I have written down here is also called doing one epoch of training and epoch is a word that means a single pass through the training set. Whereas with batch gradient descent, a single pass through the training allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps.

Understanding Mini-Batch Gradient Descent

mini-batch gradient descent2

If mini-batch size = m, then batch gradient descent.
- too long per iteration
If mini-batch size = 1, then stochastic gradient descent
- lose speedup from vectorizatio
- more noisy

If small training set, use batch gradient descent (m <= 200, typically 54, 128, 256, 512 …). And make sure all mini-batch fits in CPU/GPU memory.

Exponentially Weighted Averages

Exponentially weighted averages are faster than gradient descent, and then we’ll use this to build up to more sophisticated optimization algorithms.

v t = β v t - 1 + (1 - β) θ t, where v t is as approximately averaging

Understanding Exponentially Weighted Averages

v t = (1 - β) θ t + (1 - β) β θ t - 1 + (1 - β) β 2 θ t - 2 + . . . + (1 - β) β t - 1 θ 1 Because (1 - ε) ε \approx 1 e, so 1 1 - β will weight more than 2 3

This is a very efficient way to do so both from computation and memory efficiency point of view which is why it’s used in a lot of machine learning.

Bias Correction

When t is small, v is very small because previous v is very small.

But during this initial phase of learning when you’re still warming up your estimates when the bias correction can help you to obtain a better estimate

v t = β v t - 1 + ( 1 - β ) θ t 1 - β t

In machine learning, for most implementations of the exponential weighted average, people don’t often bother to implement bias corrections. Because most people would rather just wait that initial period and have a slightly more biased estimate and go from there. But if you are concerned about the bias during this initial phase, while your exponentially weighted moving average is still warming up. Then bias correction can help you get a better estimate early on.

Gradient Descent With Momentum

Compute dW, db on current mini-batch

Then compute V

V d W = β V d W + (1 - β) d W V d b = β V d b + (1 - β) d b

Then update parameters

W = W - α V d W b = W - α V d b

What this does is smooth out steps of gradient descent.

The most common value for β is 0.9

With a few iterations you find that the gradient descent with momentum ends up eventually just taking steps that are much smaller oscillations in the vertical direction, but are more directed to just moving quickly in the horizontal direction. And so this allows your algorithm to take a more straightforward path, or to damp out the oscillations in this path to the minimum.

RMSprop （Root Mean Squared prop）

Compute dW, db on current mini-batch

Then compute S

S d W = β S d W + (1 - β) d W 2 S d b = β S d b + (1 - β) d b 2

Then update parameters

W = W - α d W S d W ‾ ‾ ‾ ‾ \sqrt b = b - α d b S d b ‾ ‾ ‾ \sqrt

The net effect of this is that your up days in the vertical direction are divided by a much larger number, and so that helps damp out the oscillations. Whereas the updates in the horizontal direction are divided by a smaller number.

And also to make sure that your algorithm doesn’t divide by 0

RMSprop, and similar to momentum, has the effects of damping out the oscillations in gradient descent, in mini-batch gradient descent. And allowing you to maybe use a larger learning rate alpha. And certainly speeding up the learning speed of your algorithm.

Adam Optimization Algorithm

Adam stands for Adaptive Moment Estimation.

Compute dW, db on current mini-batch

Then compute V with momentum

V d W = β 1 V d W + (1 - β 1) d W V d b = β 1 V d b + (1 - β 1) d b

Then compute S with RMSprop

S d W = β 2 S d W + (1 - β 2) d W 2 S d b = β 2 S d b + (1 - β 2) d b 2

Then do bias correction

V c o r r e c t e d d W = V d W 1 - β t 1 V c o r r e c t e d d b = V d b 1 - β t 1 S c o r r e c t e d d W = S d W 1 - β t 2 S c o r r e c t e d d b = V d b 1 - β t 2

Then update parameters

W = W - α V c o r r e c t e d d W S c o r r e c t e d d W ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt + ε b = b - α V c o r r e c t e d d b S c o r r e c t e d d b ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt + ε

So this algorithm combines the effect of gradient descient with momentum together with gradient descent with RMSprop

Hyperparameters Choice

α : needs to be tune β 1 : 0.9 (d W) β 2 : 0.999 (d W 2) ε : 10 - 8

Learning Rate Decay

One of the things that might help speed up your learning algorithm, is to slowly reduce your learning rate over time. We call this learning rate decay.

α = 1 1 + d e c a y _ r a t e * e p o c h _ n u m α 0, where d e c a y_r a t e and α 0 are hyperparamters

Other Leanring Rate Decay

α = 0.95 e p o c h_n u m α 0 α = k e p o c h _ n u m ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt α 0 α = α 0 [ e p o c h _ n u m / 10 ]

Week3

Tuning Process

Hyperparameters

α : learning rate (most important) β 1 \approx 0.9 : β in motentum β 2 \approx 0.999 : β in RMSprop n [l] : # of hidden units mini-batch size L : # of layers learning rate decay

Try random values and do not use a grid

Coarse to fine search

Using an Appropriate Scale to Pick hyperparameters

n [l] : # of hidden units (uniform scale) L : # of layers (uniform scale) α : learning rate (log scale) β : (log scale)

Hyperparameters Tuning in Practice: Pandas vs. Caviar

Intuitions do get stale. Re-evaluate occasionally.

Babysitting One Model

If you have maybe a huge data set but not a lot of computational resources, not a lot of CPUs and GPUs, so you can basically afford to train only one model or a very small number of models at a time. In that case you might gradually babysit that model even as it’s training. People that babysit one model, that is watching performance and patiently nudging the learning rate up or down. But that’s usually what happens if you don’t have enough computational capacity to train a lot of models at the same time.

Training Many Models in Parallel

You might train many different models in parallel, where these orange lines are different models, right, and so this way you can try a lot of different hyperparameter settings and then just maybe quickly at the end pick the one that works best. Looks like in this example it was, maybe this curve that look best.

Normalizing Actications in a Network

Normalizing Inputs to Speed Up Learning

μ = 1 m \sum i x (i) σ 2 = 1 m \sum i x (i) 2 x = x - μ σ 2

Implementing Batch Norm

Given some intermediate values in NN

μ [l] = 1 m \sum i z [l] (i) σ [l] 2 = 1 m \sum i z [l] (i) 2 z [l] n o r m (i) = z [ l ] ( i ) - μ σ [ l ] 2 z ̃ [l] (i) = γ z [l] n o r m (i) + β, where γ, β are learnable parameters of model

Fitting Batch Norm Into a Neural Network

adding batch norm to a network

Working With Mini-Batches

Because Batch Norm zeroes out the mean of these Z values in the layer, there’s no point having this parameter b.

working with mini-batches

Implementing Gradient Descent

implementing gradient descent

Why Batch Norm Work

One intuition behind why batch norm works is, this is to take on a similar range of values that can speed up learning, but further values in your hidden units and not just for your input there.

A second reason why batch norm works, is it makes weights, later or deeper than your network, say the weight on layer 10, more robust to changes to weights in earlier layers of the neural network because batch norm overcomes covariate shift on weights in earlier layers.

Batch Norm as Regularization

Each mini-batch is scaled by the mean / variance computed on just that mini-batch
This adds some noise to the values z within that mini-batch. So similar to dropout, it adds some noise to each hidden layer’s activations
This has a slight relularization effect because by adding noise to the hidden units, it’s forcing the downstream hidden units not to rely too much on any one hidden unit.

Softmax Regression

If we have multiple possible classes, there’s a generalization of logistic regression called Softmax regression.

The number of units upper layer which is layer L is going to equal to C (the number of possible classes).

And the output labels y hat is going to be C by one dimensional vector, because it now has to output C numbers, giving you these C probabilities.

And the upper layer’s activation function is

a [L] i = e z [ L ] i \sum j e z [ L ] j

Understanding Softmax

Softmax regression generalizes logistic regression to C classes

Loss Function

ℓ (y ̂, y) = - \sum j = 1 C y j log y j^, where y j = 1 if x belongs to j class, else y j = 0

J (W [1], b [1], W [2], b [2], . . ., W [m], b [m]) = 1 m \sum i = 1 m ℓ (y ̂ (i), y (i))

Backward Prop

d Z [L] = y ̂ - y

阅读全文

0 0