Linear Regression

来源:互联网 发布:东莞网络推广培训班 编辑:程序博客网 时间:2024/05/12 02:21

线性回归的模型很简单。该模型是假设输出hθ(x)与输入x的任何一个分量都是呈线性关系的。用数学来表示即

hθ(x)=θTx.

这里θ表示的是参数向量,x表示的是输入向量。一般来说,θx是维度相同的列向量,而hθ(x)是一个标量。
x=x0x1xn,θ=θ0θ1θn.

我们用n来表示feature的个数,尽管θx都是n+1维度的。那是因为x0被恒定的设置为1,用来代表线性模型里面的”常数项”。在某些特定的情况下,hθ(x)也是一个向量。这时,θ的维度不再与x相同,θ会是一个矩阵,通常用大写的希腊字母Θ来表示。

使用线性回归模型时,θ的值是必不可少的。现在我们来研究一下对θ的估计。假设我们有一组现成的数据。基于这些数据我们来估计θ。我们用m来表示数据的个数,x(i)表示第i个数据的“输入”;与之对应的第i个数据的“输出”用y(i)来表示。成本函数(cost function)被定义为

J(θ)=12mi=1m(y(i)θTx(i))2.

可以看出这个成本函数是用来描述回归模型预测的结果θTx与标准输出y之间的平均误差的。显然,我们希望选取一组可以使得成本函数J(θ)取最小值的θ,即
minθJ(θ).

如果我们把训练序列用矩阵/向量的形式来表述的话,在描述成本函数时,我们可以省去加和符号。我们用Xy来表示训练数据的输入与输出,其中

X=(x(1))T(x(2))T(x(m))T=x(1)0x(2)0x(m)0x(1)1x(2)1x(m)1x(1)nx(2)nx(m)n,y=y(1)y(2)y(m).

那么,成本函数则可以被表示为
J(θ)=12myXθ2.

下面我们给一个叫做normal equation的方法,来确定最佳参数向量θ。由于下面这段是从维基百科上面抄的,大家凑合看吧。关于成本函数的定义稍稍有一点不同,维基百科上给的是S=2J。不过对最后求出的θ没有影响。

另一种同样比较常见的数值解法请参见我的另一篇博文,Gradient Descent Algorithm。

Derivation of the normal equations

Common method

Define the ith residual to be

r(i)=y(i)j=0nx(i)jθj.

Then S can be rewritten

S=i=1m(r(i))2.

S is minimized when its gradient vector is zero. (This follows by definition: if the gradient vector is not zero, there is a direction in which we can move to minimize it further - see maxima and minima.) The elements of the gradient vector are the partial derivatives of S with respect to the parameters:

Sθj=2i=1mr(i)r(i)θj (j=0,1,2,,n).

The derivatives are

r(i)θj=x(i)j.

Substitution of the expressions for the residuals and the derivatives into the gradient equations gives

Sθj=2i=1m(y(i)k=0nx(i)kθk)(x(i)j) (j=1,2,,n).

Thus if θ^ minimizes S, we have

2i=1m(y(i)k=0nx(i)kθk)(x(i)j)=0 (j=1,2,,n).

Upon rearrangement, we obtain the normal equations:

i=1mk=0nx(i)jx(i)kθ^k=i=1mx(i)jy(i) (j=1,2,,n).

The normal equations are written in matrix notation as

(XTX)θ^=XTy

where XT is the matrix transpose of X.
The solution of the normal equations yields the vector θ^ of the optimal parameter values.

Derivation directly in terms of matrices

The normal equations can be derived directly from a matrix representation of the problem as follows. The objective is to minimize

S(θ)=yXθ2=(yXθ)T(yXθ)=yTyθTXTyyTXθ+θTXTXθ.

Note that: (θTXTy)T=yTXθ has the dimension 1x1 (the number of columns of y), so it is a scalar and equal to its own transpose, hence θTXTy=yTXθ and the quantity to minimize becomes

S(θ)=yTy2θTXTy+θTXTXθ.

Differentiating this with respect to θ and equating to zero to satisfy the first-order conditions gives

XTy+(XTX)θ=0,

which is equivalent to the above-given normal equations. A sufficient condition for satisfaction of the second-order conditions for a minimum is that X have full column rank, in which case XTX is positive definite.

Derivation without calculus

When XTX is positive definite, the formula for the minimizing value of \boldsymbol \beta can be derived without the use of derivatives. The quantity

S(β)=yTy2βTXTy+βTXTXβ

can be written as
β,β2β,(XTX)1XTy+(XTX)1XTy,(XTX)1XTy+C,

where C depends only on y and X , and , is the inner product defined by

x,y=xT(XTX)y.

It follows that S(β) is equal to
β(XTX)1XTy,β(XTX)1XTy+C

and therefore minimized exactly when
β(XTX)1XTy=0.

0 0
原创粉丝点击