统计学习笔记(3) 监督学习概论(3)

来源：互联网发布：工业设计主流软件编辑：程序博客网时间：2024/06/13 05:01

Some further statements on KNN:

Computation complexity
It appears that k-nearest-neighbor fits have a single parameter, the number of neighbors k, compared to the p parameters in least-squares fits. Although this is the case, we will see that the effective number of parameters of k-nearest neighbors is N/k and is generally bigger than p, and decreases with increasing k. To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood.
N is the size of the training set, e.g. If k=1, each member in the training set is a mean value, we should store N values, but if k>1, for each sample in the input set, we have a neighbourhood containing k elements in the training set, and if the neighbourhoods belonging to different members of the input set do not overlap, then we store N/k mean values.
Test set generation

When we generate the following graph:

We need a method to generate the test set. First we generated 10 means mk from a bivariate Gaussian distribution N((1, 0)T , I) and labeled this class BLUE. Similarly, 10 more were drawn from N((0, 1)T , I) and labeled class ORANGE. Then for each class we generated 100 observations as follows: for each observation, we picked an mk at random with probability 1/10, and then generated a N(mk, I/5), thus leading to a mixture of Gaussian clusters for each class.

Restrictions on KNN (1)-Number of Samples:

We seek a function f(X) for predicting Y given values of the input vector X. This theory requires a loss function L(Y, f(X)) for penalizing errors in prediction, and by far the most common and convenient is squared error loss: L(Y, f(X)) = (Y − f(X)) squared.

Our aim is to choose f:

Provided a given X, we should make c closer to the label Y in the training set

The above equation gives us the exact c, and the solution is

The above x is value in the training set.

To apply the above theory into practice, we can use KNN, that is, for any input x, we calculate its statistical value by averaging its cloest k neighbors in the training set. It would seem that with a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging, for the average value can approximate the statistical average value.

We often do not have very large samples. If the linear or some more structured model is appropriate, then we can usually get a more stable estimate than k-nearest neighbors.

Both k-nearest neighbors and least squares end up approximating conditional expectations by averages. But they differ dramatically in terms of model assumptions.

Least squares assumes f(x) is well approximated by a globally linear function.
k-nearest neighbors assumes f(x) is well approximated by a locally constant function.

Restrictions on KNN (2)-Curse of dimensionality and Local methods in high dimensions

if we want to compute E(Y|X=x) but there's few points with X=x, we should find the neiboring points of X=x, let

The above method is called nearest neighbor method, it can be bad when the dimensionality is large, considering curse of dimensionality, but it works well when the dimensionality is small and when the number of samples is large.

To highlight curse of dimensionality, we compare the following 2 graphs:

In the 1D case, a small region can contain many neighbors, but in 2D, not that many.

e.g. a 10% neighborhood in high dimensions need no longer be local, so we lose the spirit of estimating E(Y|X=x) by local averaging. The fraction of volume of local neighbors can be large even if the fraction is small when the dimensionality is large.

KNN breaks downin highdimensions, and the phenomenon is commonly referred to as the curse of dimensionality.
Consider the nearest-neighbor procedure for inputs uniformly distributed in a p-dimensional unit hypercube.

Suppose we send out a hypercubical neighborhood about a target point to capture a fraction r of the observations. Since this corresponds to a fraction r of the unit volume, r is a proportion and is less than 1. the expected edge length will be ep(r) =r^(1/p). In ten dimensions e10(0.01) = 0.63 and e10(0.1) = 0.80, while the entire range for each input is only 1.0. So to capture 1% or 10% of the data to form a local average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local.”Reducing r dramatically does not help much either, since the fewer observations we average, the higher is the variance of our fit.

Another consequence of the sparse sampling in high dimensions is that all sample points are close to an edge of the sample. Consider N data points (training samples) uniformly distributed in a p-dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin. The median distance from the origin to the closest data point is given by the expression

A more complicated expression exists for the mean distance to the closest point. For N = 500, p = 10 , d(p, N) ≈ 0.52, more than halfway to the boundary. Hence most data points are closer to the boundary of the sample space than to any other data point. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. For those input samples that are nearer to the centering training samples, it is easier to find enough neighbors, but for those nearer to boundary training samples, it is not.

Some expansion on classification methods

To improve linear regression and KNN, we need to finish the following tasks:

1. Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights used by k-nearest neighbors.
2. In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.
3. Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
4. Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models.
The meaning of basis expansion can be explained as follows:

Then to introduce kernel based on basis expansion:
To minimize function (e.g. in SVM's form)

We get

To expand the basis

Which has a similar form as SVM.

After being mapped to high-dimensional spaces, how to reduce computational complexity:

xi represents training samples and x represents inputs.

Here we ignore the complexity in h and use simple k.

The use of kernel is to firstly guarantee that feature can be mapped to high dimensional spaces, secondly calculation can be simplified.

5. Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.

0 1