逻辑回归(二)

来源:互联网 发布:避孕套推荐 知乎 编辑:程序博客网 时间:2024/06/10 21:53
代价函数(Cost Function)

对于线性回归模型,我们定义的代价函数J(θ)为:


现在对于逻辑回归模型我们沿用此定义,但问题是hθ(x) = g(z),而函数g为S形函数,故代价函数J(θ)将会变为像下图中左边的图那样,此时我们将其称之为非凸函数(non-convex function)。


注:国外凸函数定义与国内是相反的。

这意味着代价函数J(θ)存在无数个局部最小值,从而影响梯度下降算法的运行。为了将代价函数J(θ)变为像上图中右边的图那样,我们想要将代价函数J(θ)重新定义为:


其中


hθ(x)与Cost(hθ(x), y)之间的函数关系如下图所示:


y=1:

  • hθ(x) -> 1,则Cost(hθ(x), y) = 0;
  • hθ(x) -> 0,则Cost(hθ(x), y) -> ∞。

y=0:

  • hθ(x) -> 0,则Cost(hθ(x), y) = 0;
  • hθ(x) -> 1,则Cost(hθ(x), y) -> ∞。
补充笔记
Cost Function

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.

Instead, our cost function for logistic regression looks like:


When y = 1, we get the following plot for J(θ) vs hθ(x):


Similarly, when y = 0, we get the following plot for J(θ) vs hθ(x):



If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.

If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity.

Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.

梯度下降算法

之前为了得到一个凸函数(国内为凹函数),我们重新定义代价函数J(θ)为:


对于Cost(hθ(x), y)函数而言,它在y=1和y=0时有不同的函数表达式。为此我们可以将其合并为如下表达式:


因此,代价函数J(θ)可以简化为:


既然有了上述的代价函数代价函数J(θ),我们就要使用梯度下降算法来求得参数θ使得代价函数J(θ)最小化。

梯度下降算法
梯度下降算法

为了方便推导出迭代表达式,我们假设在单个样本下的代价函数J(θ)为:


则:


其中:


注:上述推导过程摘自bitcarmanlee博主的logistic回归详解(二):损失函数(cost function)详解和logistic回归详解(三):梯度下降训练方法这两篇文章。谢谢!

因此对于全体样本其梯度下降算法为:


虽然该表达式和线性回归模型下的梯度下降算法表达式相同,但其hθ(x)是不同的。逻辑回归模型下的假设函数hθ(x)为:


在使用梯度下降算法时,我们仍然推荐将参数θ和特征变量x向量化,其表达式为:


补充笔记
Simplified Cost Function and Gradient Descent

We can compress our cost function's two conditional cases into one case:


Notice that when y is equal to 1, then the second term (1−y)log⁡(1−hθ(x)) will be zero and will not affect the result. If y is equal to 0, then the first term −ylog⁡(hθ(x)) will be zero and will not affect the result.

We can fully write out our entire cost function as follows:


A vectorized implementation is:


Gradient Descent

Remember that the general form of gradient descent is:


We can work out the derivative part using calculus to get:


Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.

A vectorized implementation is:


高级优化

除了我们之前学习的梯度下降算法之外,还有一些高级的算法可以替代梯度下降算法使得代价函数J(θ)最小化,例如:共轭梯度算法(Conjugate Gradient)、局部优化算法(Broyden fletcher goldfarb shann,BFGS)和有限内存局部优化算法(LBFGS)。

在Octave或MATLAB中,我们可以使用fminunc()函数,该函数可以用来求解无约束非线性问题的最小值。

例如:


如上图所示,我们假设代价函数J(θ) = (θ1 - 5)2 + (θ2 - 5)2,现我们希望求出参数θ的值使得代价函数J(θ)最小化。

首先,我们创建名为costFunction.m的文件并添加如下代码:

function [jVal, gradient] = costFunction(theta)jVal = (theat(1) - 5)^2 + (theta(2) - 5)^2;gradient = zeros(2, 1);gradient(1) = 2 * (theta(1) - 5);gradient(2) = 2 * (theta(2) - 5);

代码注释为:


然后,我们在Octave中键入如下命令:

octave:3> options = optimset('GradObj', 'on', 'MaxIter', '100');octave:4> initialTheta = zeros(2, 1);octave:5> [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options)optTheta =   5.0000   5.0000functionVal =   1.5777e-030exitFlag =  1

其中fminunc函数中@表示指向costFunction.m文件中costFunction函数的指针,exitFlag = 1表示函数已经收敛找到了使得代价函数J(θ)最小化的参数θ的值。

补充说明(此处谢谢Giacche在评论区的补充):

fminunc表示Octave里无约束最小化函数,调用这个函数时,需要传入一个存有配置信息的变量options。

options = optimset('GradObj', 'on', 'MaxIter', '100');

上面的代码中,我们的设置项中’GradObj’, ‘on’,代表设置梯度目标参数为打开状态(on)。’MaxIter’,‘100’代表设置最大迭代次数为100次。

initialTheta = zeros(2,1);

initialTheta代表我们给出的一个θ的猜测初始值。

当我们调用这个fminunc函数时,它会自动的从众多高级优化算法中挑选一个来使用(你也可以把它当做一个可以自动选择合适的学习速率aa的梯度下降算法)。

最终我们会得到三个返回值,分别是满足最小化代价函数J(θ)的θ值optTheta,costFunction中定义的jVal的值functionVal,以及标记是否已经收敛的状态值exitFlag,如果已收敛,标记为1,否则为0。

补充笔记
Advanced Optimization

"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them.

We first need to provide a function that evaluates the following two functions for a given input value θ:


We can write a single function that returns both of these:

function [jVal, gradient] = costFunction(theta)  jVal = [...code to compute J(theta)...];  gradient = [...code to compute derivative of J(theta)...];end

Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()".

options = optimset('GradObj', 'on', 'MaxIter', 100);initialTheta = zeros(2,1);   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.

原创粉丝点击