正则化

来源：互联网发布：避孕套推荐知乎编辑：程序博客网时间：2024/05/19 18:45

过拟合问题（The Problem of Overfitting）

如上图所示，第一个采用单变量线性回归模型来拟合数据集，但其效果并不好，因此我们将这种情况称为欠拟合（Underfitting）或高偏差（High Bias）；第二个采用二次多项式的线性回归模型来拟合数据集，其效果恰好，因此我们将这种情况称为“Just Right”；第三个采用四次多项式的线性回归模型来拟合数据集，其虽然对数据集拟合的非常好，但其曲线忽上忽下难以针对新数据进行预测，因此我们将这种情况称为过拟合（Overfitting）或高方差（ high variance）。

除此之外，逻辑回归模型也存在上述情况，如下图所示：

根据在线性回归模型中的分析，我们不难得知第一个为欠拟合，第二个最合适，第三个过拟合。

现在我们来看看过拟合的定义：

即若数据集中存在许多特征变量，我们通过使用高次方多项式来拟合数据集，其看似将数据集中的每个数据都拟合得很好，但其对于新数据的处理就无法做得很好，即泛化较差（泛化指一个假设模型能应用到新样板的能力），这时我们将其称为过拟合。

Question:
Consider the medical diagnosis problem of classifying tumors as malignant or begin. If a hypothesis h_θ(x) has overfit the training set, it means that:
A. It makes accurate predictions for examples in the training set and generalizes well to make accurate predictions on new, previously unseen examples.
B. It does not make accurate predictions for examples in the training set, but it does generalize well to make accurate predictions on new, previously unseen example.
C. It makes accurate predictions for examples in the training set, but it does not generalize well to make accurate predictions on new, previously unseen examples.
D. It does not make accurate predictions for examples in the training set and does not generalize well to make accurate predictions on new, previously unseen examples.

根据过拟合的定义我们不难得知C为正确答案。

针对过拟合问题，我们有如下方法来解决：

减少特征变量的个数：
- 人工选择特征变量
- 使用模型选择算法，自动选择特征变量
正则化：保留所有特征变量，但减小参数θ_j的值

补充笔记

The Problem of Overfitting

Consider the problem of predicting y from x ∈ R. The leftmost figure below shows the result of fitting a y = θ₀+θ₁x to a dataset. We see that the data doesn’t really lie on straight line, and so the fit is not very good.

Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. At the other extreme, overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting:

Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm (studied later in the course).
Regularization
- Keep all the features, but reduce the magnitude of parameters θ_j.
- Regularization works well when we have a lot of slightly useful features.

代价函数（Cost Function）

若假设函数h_θ(x) = θ₀ + θ₁x₁ + θ₂x₂² + θ₃x₃³ + θ₄x₄⁴，则会出现对下图数据集过拟合的情况。

现假设所有的特征变量x都是非常重要的，因此我们不能舍弃任何一个特征变量x。为了解决这个问题，我们使用正则化的方法将参数θj的值变小。

为此我们需要将代价函数J(θ)修改为如下图所示那样：

当我们使用梯度下降算法或其他高级算法来求得了参数θ的值来使得代价函数J(θ)最小化时，其θ₃和θ₄的值相比之前对新数据预测的影响要小。为什么呢？

这时因为我们通过使用正则化方法，在求得代价函数J(θ)最小化时，其θ₃和θ₄的值会无限接近于0。因此，假设函数h_θ(x)甚至可以改写为h_θ(x) = θ₀ + θ₁x₁ + θ₂x₂²。

如若某个数据集中有非常多的特征变量x且每个特征变量都非常重要，为了避免过拟合问题，我们可将代价函数J(θ)修改为：

其中λ称为正则化参数（Regularization Parameter）。因此，我们将这种方法称为正则化。

注：此处我们无需考虑θ₀。

对于正则化参数λ的选择我们也要慎重，一旦其值过大，则θ₁，θ₂，θ₃和θ₄都会无限接近于0。此时，假设函数h_θ(x)甚至可以改写为h_θ(x) = θ₀。

其结果如图中红线所示，这样就出现了欠拟合问题。

补充笔记

Cost Function

If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.

Say we wanted to make the following function more quadratic:

We'll want to eliminate the influence of θ₃x³ and θ₄x⁴ . Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:

We've added two extra terms at the end to inflate the cost of θ₃ and θ₄. Now, in order for the cost function to get close to zero, we will have to reduce the values of θ₃ and θ₄ to near zero. This will in turn greatly reduce the values of θ₃x³ and θ₄x⁴ in our hypothesis function. As a result, we see that the new hypothesis (depicted by the pink curve) looks like a quadratic function but fits the data better due to the extra small terms θ₃x³ and θ₄x⁴.

We could also regularize all of our theta parameters in a single summation as:

The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.

Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting.

正则化的线性回归（Regularized Linear Regression）

正则化的代价函数J(θ)为：

现在我们使用学过的梯度下降算法和正规方程法来求出使得代价函数J(θ)最小化的参数θ的值。

梯度下降算法

由于在正则化过程中，我们不对θ₀做任何处理，于是梯度下降算法的表达式为：

对于j=1, 2, 3, ...时的迭代表达式可改写为：

其中1-α*λ/m﹤1一定成立。

正规方程

正则化的正规方程的公式为：

其中L矩阵为(n+1)*(n+1)。

对于样本数量m小于特征变量x的个数n时，X^TX为不可逆矩阵（奇异矩阵），若如我们在Octave中使用pinv()函数则可求出其伪逆矩阵，但使用inv()则无法求出其可逆矩阵。

注：对于样本数量m等于特征变量x的个数n时，X^TX可能为不可逆矩阵（奇异矩阵）。

存在正则化参数λ﹥0时，即使当样本数量m小于等于特征变量x的个数n时，X^TX为不可逆矩阵，也可使用inv()求出其可逆矩阵。

补充笔记

Regularized Linear Regression

We can apply regularization to both linear regression and logistic regression. We will approach linear regression first.

Gradient Descent

We will modify our gradient descent function to separate out θ₀ from the rest of the parameters because we do not want to penalize θ₀.

Normal Equation

Now let's approach regularization using the alternate method of the non-iterative normal equation.

To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

L is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else. It should have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including x₀), multiplied with a single real number λ.

Recall that if m < n, then X^TX is non-invertible. However, when we add the term λ⋅L, then X^TX + λ⋅L becomes invertible.

正则化的逻辑回归（Regularized Logistic Regression）

正则化的逻辑回归模型的代价函数J(θ)为：

梯度下降算法

其中h_θ(x) = g(θ^TX)。

高级优化算法

首先，创建costFunction.m文件并在文件中按如下图所示写出相关函数代码：

然后，如之前在逻辑回归（二）一文中所讲，在Octave中调用fminunc()函数，具体操作可回顾逻辑回归（二）一文。

补充笔记

Regularized Logistic Regression

We can regularize logistic regression in a similar way that we regularize linear regression. As a result, we can avoid overfitting. The following image shows how the regularized function, displayed by the pink line, is less likely to overfit than the non-regularized function represented by the blue line:

Cost Function

Recall that our cost function for logistic regression was:

We can regularize this equation by adding a term to the end:

阅读全文

0 0