ISLR第三章

来源：互联网发布：java socket server 编辑：程序博客网时间：2024/06/05 15:47

ISLR第三章的理解

几种常见的线性模型

简单线性回归
Y=β0+β1X
多元线性回归
Y=β0+β1X1+β2X2+...
扩展线性回归
Y=β0+β1X1+β2X2+β3X3
克服了多元线性模型 X1 与 X2 不协同作用的假设。

线性模型的评价指标

F-statistic
可以评价sales与几个变量是否有关系。F-statistic是大于1的，相同数量的样本下F-statistic越大，越说明sales与几个变量越相关，至于比较小的值究竟是否相关，可以查询F-statistic表。这里，F-statistic为570，所以我们认为他们有关系。
RSE
全名为残留标准偏差(Residual Standard Error)，RSE越小，说明训练模型越准确。

The RSE is considered a measure of the lack of fit of the model (3.5) to
the data.
R2-statistic
相比于RSE,R2 在0到1之间， R2 越大，越说明模型中Y与X相关
p-value
p-value 越小，越说明该X与Y相关，这个模型中，明显newspaper变量与sales无明显的相关关系，在之后的模型优化上应该舍弃这一变量(p81).
不能通过系数大小来判断变量是否和模型的相关关系(p134 3.c).

Consequently, it is a simple matter to compute the probability of observing any value equal to |t| or larger, assuming β1 = 0.
We call this probability the p-value. Roughly speaking, we interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response.

t-statistic
t-statistic 有正负，比较时应取绝对值，绝对值大说明该变量X与Y有相关关系，这里，newspaper的t-statistic 参数为-0.18，明显newspaper变量与sales无明显的相关关系
SE
SE是用来计算置信区间的，常用的95%的置信区间为
[coefficient−2×SE，coefficient+2×SE]
studentized residuals
studentized residuals是用来检测数据中的异常值的，一般某数据的studentized residuals的绝对值超过3就定性为异常值，需要进行处理如舍弃等。
TSS

TSS measures the total variance in the response Y , and can be squares
thought of as the amount of variability inherent in the response before the
regression is performed.

Some important questions

1. Is There a Relationship Between the Response and Predictors?

在一元回归中只要判断系数是否为0
多元回归要判断所有的系数为0，只要有一个不为0假设则不成立

in the simple linear regression setting, in order to determine
whether there is a relationship between the response and the predictor we
can simply check whether β1 = 0. In the multiple regression setting with p
predictors, we need to ask whether all of the regression coefficients are zero,
i.e. whether β1 = β2 = · · · = βp = 0.

This hypothesis test is performed by computing the F-statistic

F=(TSS−RSS)/pRSS/(n−p−1)

in the simple linear regression setting, in order to determine whether there is a relationship between the response and the predictor we can simply check whether β1 = 0. In the multiple regression setting with p predictors, we need to ask whether all of the regression coefficients are zero, i.e. whether β1 = β2 = · · · = βp = 0.

F的判定条件和n,p的大小有关，可以查F分布表

When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0. In contrast, a larger F-statistic is needed to reject H0 if n is small

对单独变量测试时，需要注意:

it seems likely that if any one of the p-values for the individual variables is very small, then at least one of the predictors is related to the response. However, this logic is flawed, especially when the number of predictors p is large

不适用的情况:

sometimes we have a very large number of variables. If p > n then there are more coefficients βj to estimate than observations from which to estimate them. In this case we cannot even fit the multiple linear regression model using least squares, so the F-statistic cannot be used, and neither can most of the other concepts that we have seen so far in this chapter.

2. Deciding on Important Variables

所有p个变量的组合情况有 2p 种

there are a total of 2p models that contain subsets of p variables. This means that even for moderate p, trying out every possible subset of the predictors is infeasible.

主要方法:

Forward selection
从null model开始一个一个添加，到一定条件停止
这种增加法有一定的缺点，主要是，它不能反映后来变化的情况。因为对于某个自变量，它可能开始是显著的，即将其引入到回归方程，但是，随着以后其他自变量的引入，它也可能又变为不显著了，但是，并没有将其及时从回归方程中剔除掉。也就是增加变量法，只考虑引入而不考虑剔除。
Backward selection (cannot be used if p > n)
这种减少法也有一个明显的缺点，就是一开始把全部变量都引入回归方程，这样计算量比较大。若对一些不重要的变量，一开始就不引入，这样就可以减少一些计算。

We start with all variables in the model, and backward remove the variable with the largest p-value
Mixed selection
前面的两种方法各有其特点，若自变量X1,X2,...,Xk 完全是独立的，则可结合这两种方法，但是，在实际的数据中，自变量X1,X2,...,Xk之间往往并不是独立的，而是有一定的相关性存在的，这就会使得随着回归方程中变量的增加和减少，某些自变量对回归方程的贡献也会发生变化。因此一种很自然的想法是将前两种方法综合起来，也就是对每一个自变量，随着其对回归方程贡献的变化，它随时可能被引入回归方程或被剔除出去，最终的回归模型是在回归方程中的自变量均为显著，不在回归方程中的自变量均不显著。

3. Model Fit

An R2 value close to 1 indicates that the model explains a large portion
of the variance in the response variable

R2
An R2 value close to 1 indicates that the model explains a large portion
of the variance in the response variable
RSE
models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.
RSE=1n−p−1RSS−−−−−−−−−√
图形
Graphical summaries can reveal problems with a model that are not visible from numerical statistics

4. Predictions

其中的不确定性:

预测时要考虑置信区间
model bias
Prediction intervals are always wider than confidence intervals

Potential Problems

Non-linearity of the response-predictor relationships.

Residual plots are a useful graphical tool for identifying non-linearity.
Correlation of error terms

An important assumption of the linear regression model is that the error
terms, 1, 2, … , n, are uncorrelated
出现原因:

Such correlations frequently occur in the context of time series data, which consists of ob-servations for which measurements are obtained at discrete points in time. In many cases, observations that are obtained at adjacent time points will have positively correlated errors

Non-constant variance of error terms

adjacent residuals tend to take on similar values
这里写图片描述

Another important assumption of the linear regression model is that the error terms have a constant variance。
Unfortunately, it is often the case that the variances of the error terms are non-constant。

the variances of the error terms may increasewith the value of the response

heteroscedasticity: 异方差性（heteroscedasticity ）是相对于同方差而言的。所谓同方差，是为了保证回归参数估计量具有良好的统计性质，经典线性回归模型的一个重要假定：总体回归函数中的随机误差项满足同方差性，即它们都有相同的方差。如果这一假定不满足，即：随机误差项具有不同的方差，则称线性回归模型存在异方差性。

解决方法:

When faced with this problem, one possible solution is to transform the response Y using a concave function such as log Y or Y−−√

加权最小二乘法:

是对原模型进行加权，使之成为一个新的不存在异方差性的模型，然后采用普通最小二乘法估计其参数。

Outliers（异常值）
可能是由采集数据时人为记录错误引起
学生化残差？
High-leverage points
In order to quantify an observation’s leverage, we compute the leverage
statistic.
hi=1n+(xi−x¯)2∑ni′=1(xi′−x¯)2
Collinearity
The smallest possible value for VIF is 1,which indicates the complete absence of collinearity.
a better way to assess multi- collinearity is to compute the variance inflation factor (VIF)
The VIF for each variable can be computed using the formula
VIF(β^j)=11−R2Xj|X−j
where R2$Xj|X−j is the R2 from a regression of Xj onto all of the other
predictors. If R2Xj|X−j is close to one, then collinearity is present, and so
the VIF will be large.

KNN回归

前提条件
the parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form of f.

通过找出一个样本的k个最近邻居，将这些邻居的属性的平均值赋给该样本，就可以得到该样本的属性。即：

f^(x0)=1K∑xi∈N0yi
更有用的方法是将不同距离的邻居对该样本产生的影响给予不同的权值(weight)，如权值与距离成正比（组合函数）