batch size && performance
来源:互联网 发布:手机相片加密软件 编辑:程序博客网 时间:2024/06/15 21:55
From Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. https://arxiv.org/abs/1609.04836 :
The stochastic gradient descent method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, usually 32--512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. There have been some attempts to investigate the cause for this generalization drop in the large-batch regime, however the precise answer for this phenomenon is, hitherto unknown. In this paper, we present ample numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions -- and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We also discuss several empirical strategies that help large-batch methods eliminate the generalization gap and conclude with a set of future research ideas and open questions.
[…]
The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by large positive eigenvalues in
∇2f(x) and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by small positive eigenvalues of∇2f(x) . We have observed that the loss function landscape of deep neural networks is such that large-batch methods are almost invariably attracted to regions with sharp minima and that, unlike small batch methods, are unable to escape basins of these minimizers.[…]
Also, some good insights from Ian Goodfellow answering to why do not use the whole training set to compute the gradient? on Quora:
The size of the learning rate is limited mostly by factors like how curved the cost function is. You can think of gradient descent as making a linear approximation to the cost function, then moving downhill along that approximate cost. If the cost function is highly non-linear (highly curved) then the approximation will not be very good for very far, so only small step sizes are safe. You can read more about this in Chapter 4 of the deep learning textbook, on numerical computation:http://www.deeplearningbook.org/contents/numerical.html
When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)). In other words, there are diminishing marginal returns to putting more examples in the minibatch. You can read more about this in Chapter 8 of the deep learning textbook, on optimization algorithms for deep learning:http://www.deeplearningbook.org/contents/optimization.html
Also, if you think about it, even using the entire training set doesn’t really give you the true gradient. The true gradient would be the expected gradient with the expectation taken over all possible examples, weighted by the data generating distribution. Using the entire training set is just using a very large minibatch size, where the size of your minibatch is limited by the amount you spend on data collection, rather than the amount you spend on computation.
Related: Batch gradient descent versus stochastic gradient descent
- batch size && performance
- hibernate batch-size
- hibernate batch size & fetch
- epoch,[batch size], iterations
- batch size的作用
- batch size, mini-batch, iterations and epoch
- Tuning Lazy Fetching (batch size)
- 深度学习中的batch size
- 深度学习中的batch、batch size与epoch
- Mini-Batch Gradient Descent介绍以及如何决定Batch Size
- Hibernate batch-size hibernate.jdbc.batch_size
- Hibernate Set Batch-size Test(Mysql)
- hibernate抓取策略fetch / batch-size
- 使用 Batch Size 提高 Transaction 性能
- Hibernate 检索策略 lazy fetch batch-size
- 深度学习中 epoch,batch size, iterations
- hibernate抓取策略 batch-size | hibernate.jdbc.fetch_size 和 hibernate....
- Caffe:深度学习中 epoch,[batch size], iterations的区别
- [POJ](3723)Conscription ---- 最小生成树(Kruskal)
- 在eclipse中集成maven集成的基本步骤
- web.xml文件version2.5和version3.0
- 数组名和数组名地址。
- opencv识别自己的脸
- batch size && performance
- 共享在阿里云ecs上安装自定义iso的方法
- 建造者模式
- swiftclient 打开debug模式
- js选择文件进行导入(FileSaver.js)
- Kafka
- 剑指offer---二叉搜索树的后序遍历序列
- Unity3d中获取手机中的摄像头
- 数据结构实验之查找四:二分查找