Linear Regression

来源：互联网发布：东莞网络推广培训班编辑：程序博客网时间：2024/05/12 02:21

线性回归的模型很简单。该模型是假设输出hθ(x)与输入x的任何一个分量都是呈线性关系的。用数学来表示即

h θ (x) = θ T x .

这里

θ表示的是参数向量，

x表示的是输入向量。一般来说，

θ与

x是维度相同的列向量，而

hθ(x)是一个标量。

x = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ x 0 x 1 ⋮ x n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥, θ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ θ 0 θ 1 ⋮ θ n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ .

我们用

n来表示feature的个数，尽管

θ与

x都是

n+1维度的。那是因为

x0被恒定的设置为1，用来代表线性模型里面的”常数项”。在某些特定的情况下，

hθ(x)也是一个向量。这时，

θ的维度不再与

x相同，

θ会是一个矩阵，通常用大写的希腊字母

Θ来表示。

使用线性回归模型时，θ的值是必不可少的。现在我们来研究一下对θ的估计。假设我们有一组现成的数据。基于这些数据我们来估计θ。我们用m来表示数据的个数，x(i)表示第i个数据的“输入”；与之对应的第i个数据的“输出”用y(i)来表示。成本函数（cost function）被定义为

J (θ) = 1 2 m \sum i = 1 m (y (i) - θ T x (i)) 2 .

可以看出这个成本函数是用来描述回归模型预测的结果

θTx与标准输出

y之间的平均误差的。显然，我们希望选取一组可以使得成本函数

J(θ)取最小值的

θ，即

min θ J (θ) .

如果我们把训练序列用矩阵/向量的形式来表述的话，在描述成本函数时，我们可以省去加和符号∑。我们用X和y来表示训练数据的输入与输出，其中

X = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ — (x (1)) T — — (x (2)) T — ⋮ — (x (m)) T — ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ x (1) 0 x (2) 0 ⋮ x (m) 0 x (1) 1 x (2) 1 ⋮ x (m) 1 \dots \dots ⋱ \dots x (1) n x (2) n ⋮ x (m) n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥, y = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ y (1) y (2) ⋮ y (m) ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ .

那么，成本函数则可以被表示为

J (θ) = 1 2 m ∥ y - X θ ∥ 2 .

下面我们给一个叫做normal equation的方法，来确定最佳参数向量θ。由于下面这段是从维基百科上面抄的，大家凑合看吧。关于成本函数的定义稍稍有一点不同，维基百科上给的是S=2J。不过对最后求出的θ没有影响。

另一种同样比较常见的数值解法请参见我的另一篇博文，Gradient Descent Algorithm。

Derivation of the normal equations

Common method

Define the ith residual to be

r (i) = y (i) - \sum j = 0 n x (i) j θ j .

Then

S can be rewritten

S = \sum i = 1 m (r (i)) 2 .

S is minimized when its gradient vector is zero. (This follows by definition: if the gradient vector is not zero, there is a direction in which we can move to minimize it further - see maxima and minima.) The elements of the gradient vector are the partial derivatives of S with respect to the parameters:

\partial S \partial θ j = 2 \sum i = 1 m r (i) \partial r ( i ) \partial θ j (j = 0, 1, 2, \dots, n) .

The derivatives are

\partial r ( i ) \partial θ j = - x (i) j .

Substitution of the expressions for the residuals and the derivatives into the gradient equations gives

\partial S \partial θ j = 2 \sum i = 1 m (y (i) - \sum k = 0 n x (i) k θ k) (- x (i) j) (j = 1, 2, \dots, n) .

Thus if

θ^ minimizes

S, we have

2 \sum i = 1 m (y (i) - \sum k = 0 n x (i) k θ k) (- x (i) j) = 0 (j = 1, 2, \dots, n) .

Upon rearrangement, we obtain the normal equations:

\sum i = 1 m \sum k = 0 n x (i) j x (i) k θ^k = \sum i = 1 m x (i) j y (i) (j = 1, 2, \dots, n) .

The normal equations are written in matrix notation as

(X T X) θ^= X T y

where

XT is the matrix transpose of

X.
The solution of the normal equations yields the vector

θ^ of the optimal parameter values.

Derivation directly in terms of matrices

The normal equations can be derived directly from a matrix representation of the problem as follows. The objective is to minimize

S (θ) = ∥ ∥ y - X θ ∥ ∥ 2 = (y - X θ) T (y - X θ) = y T y - θ T X T y - y T X θ + θ T X T X θ .

Note that:

(θTXTy)T=yTXθ has the dimension 1x1 (the number of columns of

y), so it is a scalar and equal to its own transpose, hence

θTXTy=yTXθ and the quantity to minimize becomes

S (θ) = y T y - 2 θ T X T y + θ T X T X θ .

Differentiating this with respect to θ and equating to zero to satisfy the first-order conditions gives

- X T y + (X T X) θ = 0,

which is equivalent to the above-given normal equations. A sufficient condition for satisfaction of the second-order conditions for a minimum is that

X have full column rank, in which case

XTX is positive definite.

Derivation without calculus

When XTX is positive definite, the formula for the minimizing value of \boldsymbol \beta can be derived without the use of derivatives. The quantity

S (β) = y T y - 2 β T X T y + β T X T X β

can be written as

⟨ β, β ⟩ - 2 ⟨ β, (X T X) - 1 X T y ⟩ + ⟨ (X T X) - 1 X T y, (X T X) - 1 X T y ⟩ + C,

where

C depends only on

y and

X , and

⟨⋅,⋅⟩ is the inner product defined by

⟨ x, y ⟩ = x T (X T X) y .

It follows that

S(β) is equal to

⟨ β - (X T X) - 1 X T y, β - (X T X) - 1 X T y ⟩ + C

and therefore minimized exactly when

β - (X T X) - 1 X T y = 0.

0 0