[深度学习论文笔记][Optimization] Unit Tests for Stochastic Optimization

来源：互联网发布：西门子编程器pg m4价格编辑：程序博客网时间：2024/06/05 14:39

Schaul, Tom, Ioannis Antonoglou, and David Silver. “Unit tests for stochastic optimization.” arXiv preprint arXiv:1312.6055 (2013). [Citations: 17].

1 Idea

[Motivation] There are many optimization algorithms, such as SGD, AdaGrad, AdaDelta, RMSprop, etc. Which is the best?

[Idea] Establish a collection of benchmarks to evaluate those optimization algorithms.
• Each unit test is small-scale, isolated, and well-understood difficulty.
• Rather than in real-world scenarios where many such issues are entangled.

2 Prototypes

[Shape Prototypes]
• Convex bowls (e.g., local optima).
• Long linear slopes.
• Non-convex.

• Non-differentiable (e.g., l1 ).

[Scale Prototypes]
• Multiple orders of magnitude.
• Steep cliffs (e.g., RNNs).
• Plateaus (e.g., ReLU).

[Noise Prototypes]

• Additive Gaussian noise.
• Multiplicative noise (scale-proportional).
• Mask-out noise (dropout).
• Outliers.

3 Combinations

[1D] Shape/scale/noise can be varied independently.

[Higher Dimension] Any 1D combinations can be combined.

[Correlations/Conditioning]

[Saddle Points]

4 Results
• It is difficult to substantially beat well-tuned SGD in performance on most unit tests.
• Hyper-parameter tuning matters much less for the adaptive algorithms (AdaGrad, AdaDelta, RPROP, RMSprop) than for the non-adaptive SGD variants.
• Most algorithms saturate under high noise.

5 Reference

[1]. ICLR 2014 Talk. https://www.youtube.com/watch?v=9GF9UB6kcxs.

0 0