斯坦福机器学习-第三周（分类，逻辑回归，过度拟合及解决方法）

来源：互联网发布：怎么设置linux的ip地址编辑：程序博客网时间：2024/06/05 19:10

逻辑回归(Logistic Regression)

1. 分类(Classification)

The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. For instance, if we are trying to build a spam classifier for email, then x(i)may be some features of a piece of email, and y may be 1 if it is a piece of spam mail, and 0 otherwise. Hence, y∈{0,1}. 0 is also called the negative class, and 1 the positive class.

简而言之，分类就是通过一系列的特征值，来将数据集分成不同的类别。也就是说其最终的输出y是离散的值。比如垃圾邮件的分类。

2. 假设函数(Hypothesis function)

逻辑回归中的假设函数在本质与意义上同线性回归中的假设函数，仅仅只是在形式上发生了变化。

We could approach the classification problem ignoring the fact that y is discrete-valued, and use our old linear regression algorithm to try to predict y given x. However, it is easy to construct examples below where this method performs very poorly.

example 1:

这里写图片描述

在上面的图片中，我们可以用以下表达式来表示假设函数：
当hθ(x)≥0.5，y=1;
当hθ(x)<0.5，y=0; （至于为什么是0.5，第六周课程会讲到。简单说为了提高准确度你可以设置得更大，比如0.9，但这并不代表此时的模型最优）

但是这样表示的问题就是，如果此时在添加一条数据（如下图），这个表达式就不适用了。
这里写图片描述

example 2:

在逻辑回归中，0≤hθ(x)≤1（因为hθ(x)表示的是y=1的概率）；而在线性回归中hθ(x)的取值范围可能大于1或者小于0，并且其值也不表示某种情况的概率。

Logistic Regression Model：

hθ(x)=g(θTx),g(z)=11+e−z;
hθ(x)=11+e−θTx;

hθ(x) will give us the probability that our output is 1. For example, hθ(x)=0.7 gives us a probability of 70% that our output is 1.

hθ(x)=（表示Y=1时的概率）

P(y=0|x;θ)+P(y=1|x;θ)=1

这里写图片描述

其中g(x)称为S型函数（sigmoid function)或者逻辑函数(logistic function)

3. 决策边界(Decision boundary)

In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

为了解决离散值0和1的分类问题，我们可以将假设函数转化为如下形式：

hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0

也就是说当hθ(x)大于0.5时，我们就可以认为y的取值为1了，因为超过了一半的概率。

同时，根据hθ(x)=g(θTx)=g(z)（z=θTx）的图像，我们可以得出以下结论：

当z≥0时，g(z)≥0.5;即hθ(x)≥0.5时，y=1
当z<0时，g(z)<0.5;即hθ(x)<0.5时，y=0

立即推：

当θTx≥0时，y=1; 当θTx<0时，y=0;也就是说，此时用θTx把数据集分成了两个部分。因此，我们把θTx=0这条直（或曲）线称之为决策边界。注意，决策边界仅仅只是假设函数的性质，与其他无关。

The Decision Boundary is a property of the hypothesis including the parameters θ0,θ1,θ2⋯, which is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function. And the data set is only used to fit the parameters theta.

看一个例子：

这里写图片描述

已知θ0=−3,θ1=1,θ2=1;→hθ(x)=g(−3+x1+x2)，由前面推导可知：

θTx=−3+x1+x2≥0→y=1
θTx=−3+x1+x2<0→y=0

所以，原数据集被决策边界θTx=−3+x1+x2=0分割成如下两个部分，右上方表示y=1的部分，左下方表示y=0的部分。

这里写图片描述

这里写图片描述

4. 代价函数(Cost function)

这里写图片描述

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy looks like the figure left above, causing many local optima. In other words, it will not be a convex function.
Instead, our cost function for logistic regression looks like:

J(θ)=1m∑i=1mCost(hθ(x(i),y(i));
Cost(hθ(x),y)={−log(hθ(x)),−log(1−hθ(x)),if y=1if y=0

Note: 0≤hθ(x)≤1表示的是y=1的概率

这里写图片描述

由于y的取值只有0和1，所以原式又可以写成如下形式：

J (θ) = 1 m \sum i = 1 m C o s t (h θ (x (i), y (i)) = - 1 m [\sum i = 1 m y (i) log h θ (x (i)) + (1 - y (i)) log (1 - h θ (x (i)))]

A vectorized implementation is:

h=g(Xθ)

J(θ)=1m(−yTlog(h)−(1−y)Tlog(1−h))

If our correct answer ‘y’ is 0:
then the cost function will be 0 if our hypothesis function also outputs 0.
then the cost function will approach infinity,If our hypothesis approaches 1.

If our correct answer ‘y’ is 1:
then the cost function will be 0 if our hypothesis function outputs 1.
then the cost function will approach infinity, If our hypothesis approaches 0.

Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.

5. 梯度下降(Gradient Descent)

有了代价函数，下一步就是用梯度下降算法进行最小化Minimize J(θ)了。不管是在Linear regression model 中还是Logistic regression model中，梯度下降算法的最基本形式都是一样的，只是J(θ)的形式发生了改变。

Gradient Descent
Remember that the general form of gradient descent is:

Repeat{
　　θj:=θj−α∂∂θjJ(θ)
}

在逻辑回归中:

J(θ)=−1m[∑i=1my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]

所以，求导后的表达式如下：

We can work out the derivative part using calculus to get:

Repeat{
θj:=θj−αm∑i=1m(hθ(x(i))−y(i))x(i)j
}

Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.

其中， hθ(x)=11+e−θTX;而在线性回归中hθ(x)=θTX

A vectorized implementation is:

θ:=θ−αm(hθ(x)−y⃗ )xj

推导见关于梯度下降算法的矢量化过程

6. 进阶优化(Advanced Optimization)

这里写图片描述

“Conjugate gradient”, “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they’re already tested and highly optimized. Octave provides them.

有一个观点是这样来描述梯度下降算法的：梯度下降算法做了两件事，第一计算J(θ)；第二计算∂∂θjJ(θ)。当然除了Gradient descent 之外，还有其他三种算法也能做着两件事情（以此来最优化参数θ），且比梯度下降算法更快，不用手动选择α，但却更复杂。因为复杂，所以就不用我们自己来编写这些算法，使用开源的库即可。此时我们只需要自己写好cost function以及告诉matlab我们需要用那种算法来优化参数。

这里写图片描述

如图，现在我们用Matlab中的函数fminunc来计算J(θ)和∂∂θjJ(θ)，并且最终得到参数θ的优化值。

这里写图片描述

You set a few options. This is a options as a data structure that stores the options you want. So grant up on, this sets the gradient objective parameter to on. It just means you are indeed going to provide a gradient to this algorithm. I’m going to set the maximum number of iterations to, let’s say, one hundred. We’re going give it an initial guess for theta. There’s a 2 by 1 vector.

optTheta %用来保存最后计算得到的参数值functionVal %用来保存代价函数的计算值exitFlag %用来表示最终是否收敛（1表示收敛）@costFunction %表示调用函数costFunctioin

function [ jVal,gradient ] = costFunction( theta )%此函数有两个返回值%jVal 表示 cost function%gradient 表示分别对两个参数的求导公式jVal = (theta(1) - 5)^2 + (theta(2) - 5)^2;gradient = zeros(2,1);gradient(1) = 2 * (theta(1) - 5);gradient(2) = 2 * (theta(2) - 5);end

>> options = optimset('GradObj','on','MaxIter',100);>> initialTheta = zeros(2,1);>> [optTheta,functionVal,exitFlag]=fminunc(@costFunction,initialTheta,options)

因此，不管是在逻辑回归中还是线性回归中，只需要完成下图红色矩形中的内容即可。

这里写图片描述

7. 多分类(Multi-class classification: One-vs-all)

这里写图片描述

Multi-class 简而言之就是y的输出值不再是仅仅只有0和1了。而解决这一问题的思想就是，每次都把把training set 分成两部分，即One-vs-all。

One-vs-all
Train a logistic regression classifier h(i)θ(x) for each class i to predict the probability that y=i.

On a new input x, to make a prediction, pick the class i that maximizes max{h(i)θ(x)}.

这里写图片描述

处理方法：

这里写图片描述

We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.

在解决这个问题的时候，我们根据图一，图二，图三的处理来训练三个分类器(classifier)h(1)θ(x),h(2)θ(x),h(3)θ(x) ，分表来输出y=calss1,y=calss2,y=calss3的概率。在输入一个新的x后，分别在三个分类器中计算y的值，然后选择其中最大的即可。

8. 过度拟合(Over fitting)

既然有过度拟合，那就可定有对应的欠拟合；简单的说过度拟合就是假设函数过于复杂，虽然他能完美地拟合training set 但却不能预测新的数据。这中现象不仅出现在线性回归中，逻辑回归中一样会有。下面的两幅图最左边的都是欠拟合(underfit)，最右边的都是过度拟合(overfitting)，中间的刚刚好(just right). 产生过度拟合的其中一个原因就是，训练数据太少，而特征值太多。

这里写图片描述

Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. At the other extreme, overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

这里写图片描述

那么怎么来解决这个问题呢？有两种方法：

There are two main options to address the issue of overfitting:

1) Reduce the number of features: （减少特征值）
　　Manually select which features to keep.（手动）
　　Use a model selection algorithm (studied later in the course).（利用选择模型自动）
2) Regularization
　　Keep all the features, but reduce the magnitude of parameters θj.
　　Regularization works well when we have a lot of slightly useful features.

9. 规则化（Regularization）

If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.

Say we wanted to make the following function more quadratic:
θ0+θ1x+θ2x2+θ3x3+θ4x4

We’ll want to eliminate the influence of θ3x3 and θ4x4 .

简而言之，我们想把上面的4次多项式近似的改成一个2次多项式，也就是消除3次项和4次项对原式的影响，但又不能直接去掉这两项。该怎么办呢？办法就是通过参数θ降低这些想的权重（reduce the weight）。想想，如果 θ3x3 和 θ4x4 都趋于0了，那么它对原式的影响就可以忽略不计了。

Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:

Minimize 12m∑i=1m(hθ(x(i))−y(i))2+1000θ23+1000θ24

We’ve added two extra terms at the end to inflate the cost of θ3 and θ4. Now, in order for the cost function to get close to zero, we will have to reduce the values of θ3 and θ4 to near zero. This will in turn greatly reduce the values of θ33 and θ44 in our hypothesis function. As a result, we see that the new hypothesis (depicted by the pink curve) looks like a quadratic function but fits the data better due to the extra small terms θ33 and θ44.

我们已经在原式的末尾额外加上了两项来增加θ3和θ4的代价。现在，对于代价函数来说，为了能使代价值最低（接近0），那么我们就必须去降低θ3和θ4的值，使其接近于0（因为θ3和θ4的系数很大，若θ3和θ4不能接近于0那么代价函数就不可能趋于0）。同时，这也将极大地降低θ3和θ4在假设函数中的值（权重）。最终，我能将会看到一个新的假设函数（下图的粉红曲线），其图形就类似于2次函数了，但却依旧能更好的拟合数据集了。

这里写图片描述

在这个例子中，因为我们事先知道目的（使其类似于一个二次多项式），所以我们就知道惩罚(penalize)参数θ3和θ4。试想一下，如果我们事先不知道该惩罚哪个参数呢？唯一的办法就是都进行惩罚，那么最终，hθ(x)虽然在形式上没有发生变化，但是实际上却变得更“简单”了。这就是Regularization的思想

这里写图片描述

We could also regularize all of our theta parameters in a single summation as:

min12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1nθ2j

The λ∑nj=1θ2j ,is the regularization term The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.

Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting. Hence, what would happen if λ=0 or is too small ?

这里写图片描述

在这个例子中，由于λ过大，最终将导致θ1,θ2,θ3,θ4都趋于0，所以hθ(x)≈θ0。同时，如果λ过小的话，我个人认为应该会使得regularization失效。

10.Regularized linear regression

We can apply regularization to both linear regression and logistic regression. We will approach linear regression first.

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2+λ∑j=1nθ2j
want: Minimize J(θ)

Gradient Descent
We will modify our gradient descent function to separate out θ0 from the rest of the parameters because we do not want to penalize θ0.

这里写图片描述

Repeat {
θ0:=θ0−α1m∑i=1m(hθ(x(i))−y(i))x(i)0
θj:=θj(1−αλm)−α1m∑i=1m(hθ(x(i))−y(i))x(i)jj∈{1,2…n}
}

The first term in the above equation,1−αλm will always be less than 1. Intuitively you can see it as reducing the value of θj by some amount on every update. Notice that the second term is now exactly the same as it was before.

Normal Equation
Now let’s approach regularization using the alternate method of the non-iterative normal equation.
To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

L is a matrix with 0 at the top left and 1 is down the diagonal, with 0 is everywhere else. It should have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including x0), multiplied with a single real number λ.

Recall that if m < n, then XTX is non-invertible. However, when we add the term λ⋅L, thenXTX+ λ⋅L becomes invertible.

11.Regularized logistic regression

这里写图片描述

Repeat {
θ0:=θ0−α1m∑i=1m(hθ(x(i))−y(i))x(i)0
θj:=θj(1−αλm)−α1m∑i=1m(hθ(x(i))−y(i))x(i)jj∈{1,2…n},hθ(x)=11+e−θTx
}

阅读全文

0 0