[#1]Least square and Nearest neighbors

来源：互联网发布：电脑软件专科学校编辑：程序博客网时间：2024/06/06 00:04

Least square and nearest neighbors
- 1 Least square in learn regression
- 2 Nearest neighbors
Rationality and difference of least square and nearest neighbors
- rationality least square and nearest neighbors
- extension of these simple procedures

1. Least square and nearest neighbors

1.1 Least square in learn regression

Assume we have a data set {(X(i),y(i))}Ni=1, and we will fit a linear regression y=XTβ+b on this data set. Notations to be used:

Y=(y(1),y(2),⋯,y(N))
X=(X(1)T,X(2)T,⋯,X(N)T)T

Firstly, we need to choose a loss function. Here we choose the Least squares, which aims to minimize the quadratic function of parameter β,

R S S      residual sum of square (β) = ∥ Y - X β ∥ 2 F

which leads to the solution

β^=(XTX)−1XTY if

X is column full rank. Then prediction function at any

X is given by

y^=XTβ^.

1.2 Nearest neighbors

In regression, nearest neighbors method averages the outputs of the k-nearest points of X as prediction value at X, which can be formulated as

y^(X) = 1 k \sum X (i) \in N k (X) y (i)

In classification, nearest neighbors method uses the maximum votes of labels of the k-nearest points of

X as the class for

X, which can be formulated as

g^(X)=argmaxg∑X(i)∈Nk(X)1{y(i)=g}

2. Rationality and difference of least square and nearest neighbors

least square makes huge assumptions about structure but nearest neighbors not
least square yields stable but possible inaccurate predictions, while predictions of nearest neighbors are often accurate but can be unstable

Note that:

stable means low variance
accurate means low bias

rationality least square and nearest neighbors

Suppose that we have random variables (X,Y) with jointly distribution Pr(X,Y). Then we want to find a function f(X) to approximate Y. If we use the suqare loss function as a criteria for choosing f(X),

E P E      epxect prediction error (f) = E ∥ Y - f (X) ∥ 2 F = E X E Y | X [∥ Y - f (X) ∥ 2 F | X]

It suffices to minimize

EPE pointwise

f (x) = arg min E Y | X [∥ Y - f (X) ∥ 2 F | X = x]

The solution is

f (x) = E [Y | X = x]

And least square and nearest neighbors both aim to approximate the expectation by averaging.

Least square assumes the linear structure and approximate the expectation in square loss function by averaging all training datas.

\mathrm{\hat{EPE}}(\beta) = \frac{1}{N}\sum_{i=1}^N\|y_i - X^{(i)}^t\beta\|_F^2

Nearest neighbors approximates the conditional expectation in solution by averaging the outputs near target

Y^= a v e (y i | X (i) \in N k (x))

So, two things are happening in approximating of both least square and nearest neighbor.

Least square

model structure assumption

averaging over all training data in EPE

Nearest neighbors

1.condition on a small region of target point x instead of conditioning on it
2.averaging the outputs which are near to x

extension of these simple procedures

There are many complex algorithms are from these two,

Kernel Methods use weights that decrease smoothly to zero with distance from the target point, rather that the effective 0/1 weights used by k-nearest neighbors.
In high dimensional spaces the distance kernels are modified to emphasize some variable more than others.
Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
Linear models fit to a basis expansion of the original inputs allow arbitrary complex models.
Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.

0 0