Machine Learning - IV. Linear Regression with Multiple Variables多变量线性规划 (Week 2)

来源：互联网发布：财务分析需要哪些数据编辑：程序博客网时间：2024/05/18 02:04

http://blog.csdn.net/pipisorry/article/details/43529845

机器学习Machine Learning - Andrew NG courses学习笔记

multivariate linear regression多变量线性规划

（linear regression works with multiple variables or with multiple features）

Multiple Features(variables)多特征（变量）

{x上标i代表第i个trainning example; x下标i代表特定trainning example中的第i个数值}

the hypothesis for linear regression with multiple features(variables)多变量线性回归的假设函数的表示

{Note: We store each example as a row in the the X matrix in Octave.

To take into account the intercept term (0), we add an additional first column to X and set it to all ones. This allows us to treat theta0 as simply another `feature'.}

additional zero feature x0(为了方便表示)

for every example i have a feature vector X superscript I and X superscript I subscript 0 is going to be equal to 1.

Gradient Descent for Multiple Variables多变量的梯度下降

模型表示

通过gradient descent algorithm求解cost func最小值来求parameters θ

{其中左边是单变量线性规划求解参数的gradient descent algorithm;

右边是多变量线性规划求解参数的算法}

计算J(theta)每次迭代后的值的vectorization代码：

J = ( (X * theta - y)' * (X * theta - y)) / (2 * m);

%J = sum( (X * theta - y).^2 ) / (2 * m);

计算每次迭代后theta的值的vectorization代码：

theta -= alpha / m * (X' * (X * theta - y));

Gradient Descent in Practice I - Feature Scaling梯度下降实践1 - 特征缩放

{feature skill ： getting features to be on similar ranges of Scales of similar ranges of values of each other.

for making gradient descent work well： make gradient descent run much faster and converge in a lot fewer other iterations.}

why:

if you make sure that the features are on a similar scale, then gradient descents can converge more quickly.（如下右图）
如果不进行feature scaling:
gradients may end up taking a long time and can oscillate(振荡) back and forth and take a long time before it can finally find its way to the global minimum.(如下左图)

how to feature scaling?

1. 除以最大值max 或者范围range(max - min)

In addition to dividing by so that the maximum value when performing feature scaling sometimes.{每个feature都除以最大值来做feature scaling, 使取值区间在[-1, 1]类似范围内就可以}

If you end up having a different feature that winds being between -2 and + 0.5,this is close enough to minus one and plus one, and that's fine.{x1,x2,x3 不必一定都在区间[-i, i]上，只要比较接近就可以}

if you have a different feature, say X3 ranges [-100, +100] or if X 4 takes on values between [-0.0001, +0.0001], this is a very different values than minus 1 and plus 1. So, this might be a less well-skilled(poorly scaled) feature and similarly.{但差别不能太大}
关于feature区间的一个好的选择：if a feature takes on the range of values from say [-3, 3] should be just fine.

2. mean normalization

也即

{note:x1 or x2 can actually be slightly larger than .5 but, close enough.any value that gets the features into anything close to these sorts of ranges will do fine.

S1为range 或者标准差standard deviations.The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature (most data points will lie within 2 standard deviations of the mean); this is an alternative to taking the range of values.

the extra column of 1's corresponding to x0 = 1 has not yet been added to X when scaling.and first column of X is all-ones. Thus, it does not need to be normalized.

When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation. Given a new x value, we must first normalize x using the mean and standard deviation that we had previously computed from the training set.

Code1:

mu = mean(X);
sigma = std(X);
for line = 1:size(X,1)
X_norm(line, :) = (X_norm(line, :) - mu) ./ sigma;
end

Code2:

%for i=1:size(X,2)
%   mu(1,i) = mean(X(:,i));
%   sigma(1,i) = std(X(:,i));
%   X_norm(:,i) = (X(:,i)-mu(1,i))/sigma(1,i);
%end

}

总之：the feature scaling doesn't have to be too exact,in order to get gradient descent to run quite a lot faster.

Gradient Descent in Practice II - Learning Rate α梯度下降实践2 - 学习率

如何判断gradient descent迭代到收敛了Howto make sure gradient descent is working correctly

1. 绘图法（图左）：usually by plotting this sort of plot,by looking at these plots that I tried to tell if gradient descent has converged.

for iter = 1:num_iters
   theta -= alpha / m * (X' * (X * theta - y));
    % Save the cost J in every iteration
    J_history(iter) = computeCost(X, y, theta);

figure;

plot(1:numel(J_history), J_history, '-b', 'LineWidth', 2);
%plot(1:num_iters, J_history);

2. 收敛测试（图右）

but usually choosing what this threshold is is pretty difficult.So, in order to check your gradient descent has converged,tend to look at plots like like this figure on the left.

How to choose learning rate α学习率alpha怎么选择？

alpha的选择可以参见【Linear Regression with One Variable-alpha的设置对cost func的影响】

Features and Polynomial Regression特征和多项式回归

{about the choice of features that you have and how you can get different learning algorithm}

polynomial regression(多项式回归) allows you to use the machinery of linear regression to fit very complicated, even very non-linear functions.

defining new features：sometimes by defining new features you might actually get a better model.Closely related to the idea of choosing your features is this idea calledpolynomial regression.

{这里重新将frontage(x1)和depth(x2)的乘积定义为land area面积，从而将多变量分布转换为单变量分布，建立更好的model}

多项式回归建模：对house price prediction重新建模，从线性规划建模转换到多项式建模

1. 二次方程建模 you may think the price is a quadratic function(二次方程blue line)
But then you may decide that your quadratic model doesn't make sense, eventually this function comes back down,but we don't think housing prices should go down when the size goes up too high.

2. 三次方程建模 use a cubic function(green line), where we have now a third-order term

Note: if you choose your features like x = size^n, then feature scaling becomes increasingly important.（如蓝色注释）

square root function： （fuchsia line紫红色线）But rather than going to a cubic model there, you have, maybe, other choices of features and there are many possible choices.

{feature具体怎么选择，请见之后的讲解}

Normal Equation普通方程

{for some linear regression problems, will give us a much better way to solve for the optimal value of the parameters theta.

A method to solve for theta analytically, so that rather than needing to run this iterative algorithm(gradient descent)

除了梯度下降之外，另一种求解参数theta的方法}

1. set偏导为0，联立方程组求解theta：

2. 矩阵求解

首先构建Design matrix

Example:

通过公式计算theta的值（可通过octave中自带的函数求解）

{Note:

1. 用这种方法时不需要feature scaling

2. X‘X（也是实对称矩阵[hermite矩阵的实特例]）只是保证它是n*n的矩阵可求逆，但不保证可逆

3. theta = pinv(X' * X) * X' * y;公式的推导（先通过使J(θ)偏导=0得出Xθ=y）

}

Advantages and disadvantages of Gradient descent & normal equation

Note: for almost computed implementations the cost of inverting the matrix, rose roughly as the cube of the dimension of the matrix.

so 两种方法都在什么情况下使用？
So if n is large then might usually use gradient descent; and for more complex learning algorithm,e.g. classification algorithm, like a logistic regression algorithm,The normal equation method actually do not work,we will have to resort to gradient descent for those algorithms.

But, if n is relatively small,then the normal equation might give you a better way to solve the parameters.

What does small and large mean?
it is usually around ten thousand that I might start to consider switching over to gradient descents or maybe,some other algorithms that talk about later.

Normal Equation Noninvertibility不可逆的普通方程(Optional)

{what if X transpose X is non-invertible}

Octave has two functions for inverting matrices:
One is called pinv(), and the other is called inv().One's called the pseudo-inverse, one's called the inverse.as long as you use the pinv() function,then this will actually compute the value of theta that you want,even if X transpose X is non-invertible.

if X transpose X is non-invertible,there are usually two most common causes:(导致XtX奇异的原因及解决方法{证明略})

Note:

1. 两个不同feature不应该存在线性关系，否则就是redundant冗余的feature

2. regularization,which will kind of let you fit a lot of parameters using a lot of features even if you have a relatively small training set.

出现Noninvertibility问题的解决思路：recommend do first: look at your features and see if you have redundant features like these x1 and x2 being linearly dependent,or being a linear function of each other; then,check if I might have too many features,delete some features if I can bare to use fewer features,or else I would consider using regularization.

Review复习

Question 1 :

Question 2 :

from:http://blog.csdn.net/pipisorry/article/details/43529845

ref:多变量模型回归——数据分析经典《Data Analysis for Politics and Policy》第四章《Multiple Regression》

1 0