gradient descent vs (mini-batch) stochastic gradient descent
来源:互联网 发布:调音器软件下载 编辑:程序博客网 时间:2024/05/17 20:13
In order to explain the differences between alternative approaches to estimating the parameters of a model, let's take a look at a concrete example: Ordinary Least Squares (OLS) Linear Regression. The illustration below shall serve as a quick reminder to recall the different components of a simple linear regression model:
In Ordinary Least Squares (OLS) Linear Regression, our goal is to find the line (or hyperplane) that minimizes the vertical offsets. Or, in other words, we define the best-fitting line as the line that minimizes the sum of squared errors (SSE) or mean squared error (MSE) between our target variable (y) and our predicted output over all samples i in our dataset of size n.
Now, we can implement a linear regression model for performing ordinary least squares regression using one of the following approaches:
- Solving the model parameters analytically (closed-form equations)
- Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton's Method, Simplex Method, etc.)
GRADIENT DESCENT (GD)
Using the Gradient Decent (GD) optimization algorithm, the weights are updated incrementally after each epoch (= pass over the training dataset).
The cost function J(⋅), the sum of squared errors (SSE), can be written as:
The magnitude and direction of the weight update is computed by taking a step in the opposite direction of the cost gradient
where η is the learning rate. The weights are then updated after each epoch via the following update rule:
where Δw is a vector that contains the weight updates of each weight coefficient w, which are computed as follows:
Essentially, we can picture GD optimization as a hiker (the weight coefficient) who wants to climb down a mountain (cost function) into a valley (cost minimum), and each step is determined by the steepness of the slope (gradient) and the leg length of the hiker (learning rate). Considering a cost function with only a single weight coefficient, we can illustrate this concept as follows:
STOCHASTIC GRADIENT DESCENT (SGD)
In GD optimization, we compute the cost gradient based on the complete training set; hence, we sometimes also call it batch GD. In case of very large datasets, using GD can be quite costly since we are only taking a single step for one pass over the training set -- thus, the larger the training set, the slower our algorithm updates the weights and the longer it may take until it converges to the global cost minimum (note that the SSE cost function is convex).
In Stochastic Gradient Descent (SGD; sometimes also referred to as iterative or on-lineGD), we don't accumulate the weight updates as we've seen above for GD:
Instead, we update the weights after each training sample:
Here, the term "stochastic" comes from the fact that the gradient based on a single training sample is a "stochastic approximation" of the "true" cost gradient. Due to its stochastic nature, the path towards the global cost minimum is not "direct" as in GD, but may go "zig-zag" if we are visualizing the cost surface in a 2D space. However, it has been shown that SGD almost surely converges to the global cost minimum if the cost function is convex (or pseudo-convex)[1]. Furthermore, there are different tricks to improve the GD-based learning, for example:
- An adaptive learning rate η Choosing a decrease constant d that shrinks the learning rate over time:
- Momentum learning by adding a factor of the previous gradient to the weight update for faster updates:
A NOTE ABOUT SHUFFLING
There are several different flavors of SGD, which can be all seen throughout the literature. Let's take a look at the three most common variants:
A)
- randomly shuffle samples in the training set
- for one or more epochs, or until approx. cost minimum is reached
- for training sample i
- compute gradients and perform weight updates
- for training sample i
- for one or more epochs, or until approx. cost minimum is reached
- for one or more epochs, or until approx. cost minimum is reached
- randomly shuffle samples in the training set
- for training sample i
- compute gradients and perform weight updates
- for training sample i
- randomly shuffle samples in the training set
- for iterations t, or until approx. cost minimum is reached:
- draw random sample from the training set
- compute gradients and perform weight updates
- draw random sample from the training set
In scenario C, we draw the training samples randomly with replacement from the training set [2]. If the number of iterations t is equal to the number of training samples, we learn the model based on a bootstrap sample of the training set.
MINI-BATCH GRADIENT DESCENT (MB-GD)
Mini-Batch Gradient Descent (MB-GD) a compromise between batch GD and SGD. In MB-GD, we update the model based on smaller groups of training samples; instead of computing the gradient from 1 sample (SGD) or all n training samples (GD), we compute the gradient from 1 < k < n training samples (a common mini-batch size is k=50).
MB-GD converges in fewer iterations than GD because we update the weights more frequently; however, MB-GD let's us utilize vectorized operation, which typically results in a computational performance gain over SGD.
REFERENCES
- [1] Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6
- [2] Bottou, Léon. "Large-scale machine learning with SGD." Proceedings of COMPSTAT'2010. Physica-Verlag HD, 2010. 177-186.
- [3] Bottou, Léon. "SGD tricks." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 421-436.
阅读全文
0 0
- gradient descent vs (mini-batch) stochastic gradient descent
- Stochastic Gradient Descent vs Batch Gradient Descent vs Mini-batch Gradient Descent
- BGD(Batch Gradient Descent), SGD (Stochastic Gradient Descent), MBGD (Mini-Batch Gradient Descent)
- Batch Gradient Descent and Stochastic Gradient Descent
- Stochastic gradient descent与Batch gradient descent
- Mini-Batch Gradient Descent
- batch gradient descent和stochastic/incremental gradient descent
- 【转载】Stochastic Gradient Descent
- Stochastic Gradient Descent (SGD)
- Optimization:Stochastic Gradient Descent
- Optimization: Stochastic Gradient Descent
- method_SGD(Stochastic Gradient Descent)
- Stochastic Gradient Descent
- 深度学习—加快梯度下降收敛速度(一):mini-batch、Stochastic gradient descent
- Batch & Stochatic Gradient Descent
- batch&stochasic gradient descent
- Batch Gradient Descent(python)
- Batch Gradient Descent
- java泛型
- Microbiome:宏蛋白质组揭示健康人肠道菌群的功能,离真相更近了一步
- ubuntu命令和vim指令
- Dubbo教程(三)----一个简单的Dubbo示例
- 经典算法KMP
- gradient descent vs (mini-batch) stochastic gradient descent
- luogu1057【2008普及】传球游戏(dp)
- $a = in_array('01', array('1')) == var_dump('01' == 1);$a的值是什么?
- 用C语言实现给定两个整形变量的值,将两个值的内容进行交换。(4种方法)
- RedHat配置CentOs的yum源
- JavaScript 开发的40个经典技巧
- 输入四个整数,找出其中的最大值,用函数的嵌套调用来处理
- apache-comnons系列之commons-pool2.4 学习笔记
- 51nod1108 距离之和最小 V2