统计学习笔记(2) 监督学习概论(2)

来源：互联网发布：淘宝c店运营提成编辑：程序博客网时间：2024/05/22 05:06

Some supplements to the last note:

Prediction and Inference

Prediction is different from inference, in prediction, we can treat f as a black box, we use f^ as the estimation of f, and do not care about the exact form of f^ as long as it can yield accurate prediction for Y. In inference, we really care about the form of f, that is, the relationship between predictors and the result, the dimension of the model, etc.

Example:

For instance, consider a company that is interested in conducting a direct-marketing campaign. The goal is to identify individuals who will respond positively to a mailing, based on observations of demographic variables measured on each individual. In this case, the demographic variables serve as predictors, and response to the marketing campaign (either positive or negative) serves as the outcome. The company is not interested in obtaining a deep understanding of the relationships between each individual predictor and the response; instead, the company simply wants an accurate model to predict the response using the predictors. This is an example of modeling for prediction.

Parametric and Non-parametric Methods

We can use a parametric model to estimate f, but it might be far from the true f, in this case, we can use flexible models that can fit many different possible functional forms of f. But it needs more parameters and may cause overfitting.

Non-parametric models have no assumption about the form of f. Thin-plate spline is an example of this model, but the resulting f^ might be more variable than the true f. So in the use of thin-platespline, we can select a level of smoothness, which will be further discussed in "Resampling Methods" and "Moving Beyond Linearity". And boosting methods, with low interpretability but high flexibility is shown in "Tree-Based Methods". Lasso(setting some parameters to zero) is discussed in "Linear Model Selection and Regularization" and Generalized Additive Models (the relationship between each predictor and the response is modeled using a curve) is discussed in "Moving Beyond Linearity". Bagging, boosting and SVM are shown in"Tree-Based Methods" and "Support Vector Machines".

Clustering and other unsupervised learning methods are shown in "Unsupervised Learning".

Algorithm Division

Quantitative (e.g. Least squares linear regression): regression

Qualitative (e.g. Logistic regression): classification

When evaluating the algorithms, we use test MSE instead of training MSE.

Classification:

When evaluating classification problems, we use training error rate:

better evaluated in the test set (test error rate)

Bayes classifier:

The Bayes classifier produces the lowest test error rate, with an overall error rate

K-nearest neighbors

Given a positive integer K and a test observation x0, the KNN classifier first identifies the K points in the training data that are closest to x0, represented by. It then estimates the conditional probability for class j as the fraction of points in whose response values equal j:

The above function is a ratio of the correct labels and the number of all neighbors K. Finally, KNN applies Bayes rule among different j and classifies the test observation x0 to the class with the largest probability.

Example:

Suppose that we chooseK= 3. Then KNN will first identify the three observations that are closest to the cross.This neighborhood is shown as a circle. It consists of two blue points andone orange point, resulting in estimated probabilities of 2/3 for the blue class and 1/3 for the orange class.

Despite the fact that it is a very simple approach, KNN can often produce classifiers that are surprisingly close to the optimal Bayes classifier.

The choice ofKhas a drastic effect on the KNN classifier obtained. The following figureshows 2 KNN fits to the simulated data.

When K = 1, the decision boundary is overly flexible and finds patterns in the data that don't correspond to the Bayes decision boundary. This corresponds to a classifier that has low bias but very high variance. As K grows, the method becomes less flexible andproduces a decision boundary that is close to linear. This corresponds to a low-variance but high-bias classifier.

As K decreases, flexibility increases (training error VS test error):

In"Resampling Methods", we return to this topic and discuss various methods for estimating test error rates and thereby choosing theoptimal level of flexibility for a given statistical learning method.

0 0