机器学习应用建议(二)

来源:互联网 发布:避孕套推荐 知乎 编辑:程序博客网 时间:2024/06/05 20:05
偏差和方差的判别

高偏差和高方差本质上为学习模型的欠拟合和过拟合问题。


对于高偏差和高方差问题,即学习模型的欠拟合和过拟合问题,我们通常绘制如下图表进行判断:


高偏差——欠拟合问题

  • Jtrain(Θ)误差大
  • JCV(Θ)误差 ≈ Jtrain(Θ)误差

高方差——过拟合问题

  • Jtrain(Θ)误差小
  • JCV(Θ)误差 >> Jtrain(Θ)误差
补充笔记
Diagnosing Bias vs. Variance

In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.

  • We need to distinguish whether bias or variance is the problem contributing to bad predictions.
  • High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden mean between these two.

The training error will tend to decrease as we increase the degree d of the polynomial.

At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.

High bias (underfitting): both Jtrain(Θ) and JCV(Θ) will be high. Also, JCV(Θ)≈Jtrain(Θ).

High variance (overfitting): Jtrain(Θ) will be low and JCV(Θ) will be much greater than Jtrain(Θ).

The is summarized in the figure below:


正则化的偏差与方差

在训练模型的过程中,为了避免过拟合问题我们通常使用正则化方法。但对于正则化参数λ的选择,我们是需要谨慎考虑的。

之前,我们在考虑正则化参数λ的选择时,只是考虑单变量的情况。现在,我们要考虑在多项式的情况下,正则化参数λ的取值问题。


例如:对于某一多项式模型,我们使用正则化方法。其中,正则化参数λ=0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10。现求出最佳的正则化参数λ的值。

首先,我们将数据集分为训练集、交叉验证集和测试集三部分。

然后,当正则化参数λ=0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10时,我们分别求出Jtran(θ)和JCV(θ)。

最后,我们利用测试集对JCV(θ)最小时的某个正则化参数λ值进行计算,求出其Jtest(θ)。


图中,假设正则化参数λ=0.08时,JCV(θ)最小。

为了便于理解,以及便于找到最佳的正则化参数λ的值,我们可以画出下图:


补充笔记
Regularization and Bias/Variance

In the figure above, we see that as λ increases, our fit becomes more rigid. On the other hand, as λ approaches 0, we tend to over overfit the data. So how do we choose our parameter λ to get it 'just right' ? In order to choose the model and the regularization term λ, we need to:

  1. Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
  2. Create a set of models with different degrees or any other variants.
  3. Iterate through the λs and for each λ go through all the models to learn some Θ.
  4. Compute the cross validation error using the learned Θ (computed with λ) on the JCV(Θ) without regularization or λ = 0.
  5. Select the best combo that produces the lowest error on the cross validation set.
  6. Using the best combo Θ and λ, apply it on Jtest(Θ) to see if it has a good generalization of the problem.
学习曲线

通过绘制学习曲线可以帮助我们了解学习算法是否运行正常。学习曲线为训练集误差、交叉验证集误差与训练集样本数量m之间的函数关系图。


上图中,假设函数为hθ(x) = θ0 + θ1x + θ2x2,且此处不考虑正则化。当m = 1时,我们的假设函数hθ(x)能完美拟合训练集,其Jtrain(θ) = 0,但对于交叉验证集而言,假设函数hθ(x)的泛化能力差,其JCV(θ)的值将较大;当m=2时,我们的假设函数hθ能较好地拟合训练集,其Jtrain(θ)的值将稍微增大,但对于交叉验证集而言,假设函数hθ(x)的泛化能力依旧较差,其JCV(θ)的值将较比之前有略微减小;······;但m足够大时,Jtrain(θ)的值将增大到某一特定值后保持水平,JCV(θ)的值将减小到某一特定值后保持水平,且Jtrain(θ)的值与JCV(θ)的值非常接近。

因此,当学习算法处于高偏差的情况时,我们增加训练集样本数量是毫无用处的。


上图中,我们的假设函数hθ(x) = θ0 + θ1x + θ2x2 + ... + θ100x100,此处考虑正则化,其中正则化参数λ的值很小。当m = 5时,假设函数hθ(x)能够较好地拟合训练集,其Jtrain(θ)的值较小,但假设函数hθ(x)的泛化能力较差,其JCV(θ)的值较大;当m = 12时,假设函数hθ(x)依旧能够较好地拟合训练集,但其Jtrain(θ)的值稍微增大一些,JCV(θ)的值略微减小一些;······;当m足够大时,Jtrain(θ)的值逐渐增大,JCV(θ)的值逐渐减小。

因此,此时学习算法处于高偏差的情况时,我们增加训练集样本数量可能会有些帮助。

注:当m足够大时,Jtrain(θ)的值逐渐增大,JCV(θ)的值逐渐减小,这两者是否会相交,视频中尚未交代清楚。

补充笔记
Learning Curves

Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0 errors because we can always find a quadratic curve that touches exactly those number of points. Hence:

  • As the training set gets larger, the error for a quadratic function increases.
  • The error value will plateau out after a certain m, or training set size.

Experiencing high bias:

Low training set size: causes Jtrain(Θ) to be low and JCV(Θ) to be high.

Large training set size: causes both Jtrain(Θ) and JCV(Θ) to be high with Jtrain(Θ)≈JCV(Θ).

If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.


Experiencing high variance:

Low training set size: Jtrain(Θ) will be low and JCV(Θ) will be high.

Large training set size: Jtrain(Θ) increases with training set size and JCV(Θ) continues to decrease without leveling off. Also, Jtrain(Θ) < JCV(Θ) but the difference between them remains significant.

If a learning algorithm is suffering from high variance, getting more training data is likely to help.


下一步决定做什么

在机器学习应用建议(一)一文的开头,我们就预测结果存在高误差而提出了如下的解决方法:

  • 获取更多的样本
  • 尝试减少特征变量的数量
  • 尝试获取更多的特征变量
  • 尝试增加多项式特征
  • 尝试减小正则化参数λ的值
  • 尝试增大正则化参数λ的值

对于这些方法,我们分别进行了研究得出了如下结论:

  • 获取更多的样本——适合高方差(过拟合)问题
  • 尝试减少特征变量的数量——适合高方差(过拟合)问题
  • 尝试获取更多的特征变量——适合高偏差(欠拟合)问题
  • 尝试增加多项式特征——适合高偏差(欠拟合)问题
  • 尝试减小正则化参数λ的值——适合高偏差(欠拟合)问题
  • 尝试增大正则化参数λ的值 ——适合高方差(过拟合)问题

对于神经网络模型而言,使用“小”的模型,其容易出现高偏差(欠拟合)问题,但其优势在于计算代价较小;使用“大”的模型(即隐藏层激活单元较多或有多个隐藏层。),其容易出现高方差(过拟合)问题,且其计算代价较大。但一般而言,正则化的神经网络模型越“大”其性能越好。


通常我们选择只含有一层隐藏层的神经网络模型。但对于其他情况,只含有一层隐藏层的神经网络模型并不是最优的模型。因此,我们可以将数据集分为训练集、交叉验证集和测试集三部分,分别对隐藏层层数不同的神经网络模型进行训练,找到一个JCV(Θ)最小的神经网络模型为止。

补充笔记
Deciding What to Do Next Revisited

Our decision process can be broken down as follows:

  • Getting more training examples: Fixes high variance
  • Trying smaller sets of features: Fixes high variance
  • Adding features: Fixes high bias
  • Adding polynomial features: Fixes high bias
  • Decreasing λ: Fixes high bias
  • Increasing λ: Fixes high variance.

Diagnosing Neural Networks

  • A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
  • A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.

Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

Model Complexity Effects:

  • Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
  • Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
  • In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.
原创粉丝点击