深度学习的实践方面Quiz 2

来源：互联网发布：王者荣耀聊天软件编辑：程序博客网时间：2024/06/05 18:48

1。Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
a^[3]{7}(8)

a^[8]{3}(7)

a^[8]{7}(3)

a^[3]{8}(7)
解析：在Andrew Ng的课程中，用a^[l]表示隐藏层，用a⁽ⁱ⁾表示第i个样本，用a_i表示第i个特征，用a^{k}表示第k个mini-batch

2。Which of these statements about mini-batch gradient descent do you agree with?

You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).

One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.

Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
解析：对于普通的梯度下降法，一个epoch只能进行一次梯度下降；而对于Mini-batch梯度下降法，一个epoch可以进行Mini-batch的个数次梯度下降。

3。Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

If the mini-batch size is 1, you end up having to process the entire training set before making any progress.

If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.

If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.

If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
解析：
a. batch梯度下降：
- 对所有m个训练样本执行一次梯度下降，每一次迭代时间较长；
- Cost function 总是向减小的方向下降。
b. 随机梯度下降：
- 对每一个训练样本执行一次梯度下降，但是丢失了向量化带来的计算加速；
- Cost function总体的趋势向最小值的方向下降，但是无法到达全局最小值点，呈现波动的形式。
c. Mini-batch梯度下降：
- 选择一个1 < size < m 的合适的size进行Mini-batch梯度下降，可以实现快速学习，也应用了向量化带来的好处。
- Cost function的下降处于前两者之间。

4。Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:
这里写图片描述
Which of the following do you agree with?

Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.

If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.

If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.

Whether you’re using batch gradient descent or mini-batch gradient descent, something is wrong.
解析：
普通的batch梯度下降法和Mini-batch梯度下降法代价函数的变化趋势，如下图所示：
这里写图片描述

5。
Suppose the temperature in Casablanca over the first three days of January are the same:
Jan 1st: θ₁=10^oC
Jan 2nd: θ₂=10^oC
(We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)
Say you use an exponentially weighted average with β=0.5 to track the temperature: v0=0, v_t=βv_t−1+(1−β)θ_t. If v₂is the value computed after day 2 without bias correction, and v₂^correctedis the value you compute with bias correction. What are these values? (You might be able to do this without a calculator, but you don’t actually need one. Remember what is bias correction doing.)
v₂=10, v₂^corrected=7.5

v₂=10, v₂^corrected=10

v₂=7.5, v₂^corrected=10

v₂=7.5, v₂^corrected=7.5
解析：
v₂ = 0.5 * θ₂ + 0.5 * 0.5 * θ₁ = 7.5
v₂^corrected = v₂ / (1 - 0.5 ^0.5) = 10

6。Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
α=(1/√t)* α₀

α=1/(1+2∗t) * α₀

α=0.95^tα₀

α=e^tα₀
解析：最后一项为递增的

7。You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature:v_t=βv_t−1+(1−β)θ_t. The red line below was computed using β=0.9. What would happen to your red curve as you vary β? (Check the two that apply)
这里写图片描述
Decreasing β will shift the red line slightly to the right.

Increasing β will shift the red line slightly to the right.

Decreasing β will create more oscillation within the red line.

Increasing β will create more oscillations within the red line.
解析：
当β=0.9时，指数加权平均最后的结果如图中红色线所示；
当β=0.98时，指数加权平均最后的结果如图中绿色线所示；
当β=0.5时，指数加权平均最后的结果如下图中黄色线所示；

这里写图片描述

8。Consider this figure:
这里写图片描述
These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?

(1) is gradient descent. (2) is gradient descent with momentum (small β). (3) is gradient descent with momentum (large β)

(1) is gradient descent with momentum (small β). (2) is gradient descent. (3) is gradient descent with momentum (large β)

(1) is gradient descent. (2) is gradient descent with momentum (large β) . (3) is gradient descent with momentum (small β)

(1) is gradient descent with momentum (small β), (2) is gradient descent with momentum (small β), (3) is gradient descent
解析：β越大，下降的速率越快
Momentum 将之前的一些梯度加入考虑，随着β的增大，它可以让梯度下降显得更“平滑”；
Momentum可以被应用在梯度下降，mini-batch梯度下降和随机梯度下降中

9。Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W_[1],b_[1],…,W_[L],b_[L]). Which of the following techniques could help find parameter values that attain a small value forJ? (Check all that apply)

Try better random initialization for the weights

Try using Adam

Try initializing all the weights to zero

Try mini-batch gradient descent

Try tuning the learning rate α

10。Which of the following statements about Adam is False?

We usually use “default” values for the hyperparameters β1,β2 and ε in Adam (β1=0.9, β2=0.999, ε=10−8)

The learning rate hyperparameter α in Adam usually needs to be tuned.

Adam should be used with batch gradient computations, not with mini-batches.

Adam combines the advantages of RMSProp and momentum
解析：
Adam 优化算法的基本思想就是将 Momentum 和 RMSprop 结合起来形成的一种适用于不同深度学习结构的优化算法。
超参数的选择
α：需要进行调试；
β1：常用缺省值为0.9，dw的加权平均；
β2：推荐使用0.999，dw2的加权平均值；
ε：推荐使用10−8。

阅读全文

0 0