  1. 简单线性回归


  2. 多元线性回归


  3. 扩展线性回归


    克服了多元线性模型 X1 与 X2 不协同作用的假设。


  1. F-statistic

  2. RSE
    全名为残留标准偏差(Residual Standard Error),RSE越小,说明训练模型越准确。

  3. R2-statistic
    相比于RSE,R2 在0到1之间, R2 越大,越说明模型中Y与X相关

  4. p-value
    p-value 越小,越说明该X与Y相关,这个模型中,明显newspaper变量与sales无明显的相关关系,在之后的模型优化上应该舍弃这一变量(p81).
    不能通过系数大小来判断变量是否和模型的相关关系(p134 3.c).

Consequently, it is a simple matter to compute the probability of observing any value equal to |t| or larger, assuming β1 = 0.
We call this probability the p-value. Roughly speaking, we interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response.

  1. t-statistic
    t-statistic 有正负,比较时应取绝对值,绝对值大说明该变量X与Y有相关关系,这里,newspaper的t-statistic 参数为-0.18,明显newspaper变量与sales无明显的相关关系

  2. SE

  3. studentized residuals
    studentized residuals是用来检测数据中的异常值的,一般某数据的studentized residuals的绝对值超过3就定性为异常值,需要进行处理如舍弃等。

  4. TSS

Some important questions

1. Is There a Relationship Between the Response and Predictors?


This hypothesis test is performed by computing the F-statistic


in the simple linear regression setting, in order to determine whether there is a relationship between the response and the predictor we can simply check whether β1 = 0. In the multiple regression setting with p predictors, we need to ask whether all of the regression coefficients are zero, i.e. whether β1 = β2 = · · · = βp = 0.


When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0. In contrast, a larger F-statistic is needed to reject H0 if n is small


it seems likely that if any one of the p-values for the individual variables is very small, then at least one of the predictors is related to the response. However, this logic is flawed, especially when the number of predictors p is large


sometimes we have a very large number of variables. If p > n then there are more coefficients βj to estimate than observations from which to estimate them. In this case we cannot even fit the multiple linear regression model using least squares, so the F-statistic cannot be used, and neither can most of the other concepts that we have seen so far in this chapter.

2. Deciding on Important Variables

所有p个变量的组合情况有 2p

there are a total of 2p models that contain subsets of p variables. This means that even for moderate p, trying out every possible subset of the predictors is infeasible.


  1. Forward selection
    从null model开始一个一个添加,到一定条件停止
  2. Backward selection (cannot be used if p > n)

    We start with all variables in the model, and backward remove the variable with the largest p-value

  3. Mixed selection
    前面的两种方法各有其特点,若自变量X1,X2,...,Xk 完全是独立的,则可结合这两种方法,但是,在实际的数据中,自变量X1,X2,...,Xk之间往往并不是独立的,而是有一定的相关性存在的,这就会使得随着回归方程中变量的增加和减少,某些自变量对回归方程的贡献也会发生变化。因此一种很自然的想法是将前两种方法综合起来,也就是对每一个自变量,随着其对回归方程贡献的变化,它随时可能被引入回归方程或被剔除出去,最终的回归模型是在回归方程中的自变量均为显著,不在回归方程中的自变量均不显著。

3. Model Fit

  1. R2
    An R2 value close to 1 indicates that the model explains a large portion
    of the variance in the response variable

  2. RSE
    models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.


  3. 图形
    Graphical summaries can reveal problems with a model that are not visible from numerical statistics

4. Predictions


  1. 预测时要考虑置信区间

  2. model bias

  3. Prediction intervals are always wider than confidence intervals

Potential Problems

  • Non-linearity of the response-predictor relationships.

    Residual plots are a useful graphical tool for identifying non-linearity.

  • Correlation of error terms

Such correlations frequently occur in the context of time series data, which consists of ob-servations for which measurements are obtained at discrete points in time. In many cases, observations that are obtained at adjacent time points will have positively correlated errors

  • Non-constant variance of error terms

adjacent residuals tend to take on similar values

Another important assumption of the linear regression model is that the error terms have a constant variance。
Unfortunately, it is often the case that the variances of the error terms are non-constant。

the variances of the error terms may increasewith the value of the response

异方差性(heteroscedasticity )是相对于同方差而言的。所谓同方差,是为了保证回归参数估计量具有良好的统计性质,经典线性回归模型的一个重要假定:总体回归函数中的随机误差项满足同方差性,即它们都有相同的方差。如果这一假定不满足,即:随机误差项具有不同的方差,则称线性回归模型存在异方差性。


When faced with this problem, one possible solution is to transform the response Y using a concave function such as log Y or Y


  • Outliers(异常值)

  • High-leverage points
    In order to quantify an observation’s leverage, we compute the leverage


  • Collinearity
    The smallest possible value for VIF is 1,which indicates the complete absence of collinearity.
    a better way to assess multi- collinearity is to compute the variance inflation factor (VIF)
    The VIF for each variable can be computed using the formula


the parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form of f.






