ISLR读书笔记(2)线性回归

来源：互联网发布：linux 启动编辑：程序博客网时间：2024/06/10 15:15

欢迎访问个人主页,目前访问量太低，百度还搜不到的说。。。谢谢鼓励

读书笔记，并不打算翻译全文，打算将书中重要的知识点结合自己的理解将其分享，并在最后附上R语言相关函数应用，作为自己最近一段时间在机器学习方面学习总结。如果理解不正确，望指正。

前言

ISLR，全称为An Introduction to Statistical Learning with Applications in R,算是the Elements of Statistical Learning的基础版，里面公式推导并不多，主要是讲解统计学习中的一些常用方法，以及相关方法在R语言上的应用。ISLR官方并没用出习题的答案，不过已经有人做了一份，可以学习参考ISLR答案

第三章理解

几种常见的线性模型

1) 简单线性回归

Y \approx β 0 + β 1 X

2) 多元线性回归

Y \approx β 0 + β 1 X 1 + β 2 X 2 + . . .

3) 扩展线性回归

Y \approx β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 1

克服了多元线性模型

X1 与

X2 不协同作用的假设。
4) 多项式回归

Y \approx β 0 + β 1 X 1 + β 2 X 21 + β 3 log (X 1) + β 4 X 1 ‾ ‾ ‾ \sqrt

使线性模型能够拟合非线性关系。

线性模型的评价指标

因为公式太多，而且都可以利用程序进行运算，结合书中的模型

s a l e s = β 0 + β 1 \times T V + β 2 \times r a d i o + β 3 \times n e w s p a p e r + ϵ

得到的几种参数直接进行分析，并没有列公式。

1) F-statistic

可以评价sales与几个变量是否有关系。F-statistic是大于1的，相同数量的样本下F-statistic越大，越说明sales与几个变量越相关，至于比较小的值究竟是否相关，可以查询F-statistic表。这里，F-statistic为570，所以我们认为他们有关系。

2) RSE

全名为残留标准偏差(Residual Standard Error)，RSE越小，说明训练模型越准确。

3) R2

R2 在0到1之间， R2 越大，越说明模型中Y与X相关

4) p-value

p-value 越小，越说明该X与Y相关，这个模型中，明显newspaper变量与sales无明显的相关关系，在之后的模型优化上应该舍弃这一变量

5) t-statistic

t-statistic 有正负，比较时应取绝对值，绝对值大说明该变量X与Y有相关关系，这里，newspaper的t-statistic 参数为-0.18，明显newspaper变量与sales无明显的相关关系

6) Std. error

Std. error 是用来计算置信区间的，常用的95%的置信区间为

[c o e f f i c i e n t - 2 \times S t d . e r r o r ， c o e f f i c i e n t + 2 \times S t d . e r r o r]

7) studentized residuals

studentized residuals是用来检测数据中的异常值的，一般某数据的studentized residuals的绝对值超过3就定性为异常值，需要进行处理如舍弃等。

8) VIF

VIF用来检测变量之间是否存在线性相关关系，VIF越大，越说明变量与其他变量存在线性相关关系，建议舍弃，或者与相关变量结合成为一个新的变量。

R语言应用

0.导入数据

> library(MASS)> library(ISLR)> library(car)> fix(Boston)> names(Boston) [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"     "rad"    [10] "tax"     "ptratio" "black"   "lstat"   "medv"   > attach(Boston)

1.简单线性回归

a. 基本函数

> lm.fit = lm(medv~lstat)> summary(lm.fit)Call:lm(formula = medv ~ lstat)Residuals:    Min      1Q  Median      3Q     Max-15.168  -3.990  -1.318   2.034  24.500Coefficients:            Estimate Std. Error t value Pr(>|t|)    (Intercept) 34.55384    0.56263   61.41   <2e-16 ***lstat       -0.95005    0.03873  -24.53   <2e-16 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 6.216 on 504 degrees of freedomMultiple R-squared:  0.5441,    Adjusted R-squared:  0.5432F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

b. 置信区间

> confint(lm.fit)                2.5 %     97.5 %(Intercept) 33.448457 35.6592247lstat       -1.026148 -0.8739505

c. 绘图

> plot(lstat, medv)> abline(lm.fit,lwd=3, col="green")

lwd可以调线粗细

d. 异常值

plot( predict (lm.fit ), rstudent (lm.fit ))

可以发现有些数据超过3应为异常值

2.多元线性回归

a. 基本函数

> lm.fit1 = lm(medv~lstat+age)#第一种> lm.fit2 = lm(medv~., data = Boston)#第二种> lm.fit3 = lm(medv~.-age, data = Boston)#第三种

b. 求vif，其他函数与上面命令相同

> vif (lm.fit3 )    crim       zn    indus     chas      nox       rm      dis      rad      tax  ptratio    black1.792172 2.265290 3.991592 1.071227 4.084846 1.851068 3.620246 7.441492 9.000474 1.788084 1.343044   lstat2.599229

说明自变量之间不线性相关。

3.扩展线性回归

> lm.fit4 = lm(medv~lstat*age)> summary(lm.fit4)Call:lm(formula = medv ~ lstat * age)Residuals:    Min      1Q  Median      3Q     Max-15.806  -4.045  -1.333   2.085  27.552Coefficients:              Estimate Std. Error t value Pr(>|t|)    (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***age         -0.0007209  0.0198792  -0.036   0.9711    lstat:age    0.0041560  0.0018518   2.244   0.0252 *  ---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 6.149 on 502 degrees of freedomMultiple R-squared:  0.5557,    Adjusted R-squared:  0.5531F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

4.多项式回归

a. 二次多项式回归

> lm.fit2 = lm(medv~lstat+I(lstat^2))> summary(lm.fit2)Call:lm(formula = medv ~ lstat + I(lstat^2))Residuals:     Min       1Q   Median       3Q      Max-15.2834  -3.8313  -0.5295   2.3095  25.4148Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept) 42.862007   0.872084   49.15   <2e-16 ***lstat       -2.332821   0.123803  -18.84   <2e-16 ***I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 5.524 on 503 degrees of freedomMultiple R-squared:  0.6407,    Adjusted R-squared:  0.6393F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

b. 高次线性回归

> lm.fit5 = lm(medv~poly(lstat, 5))> summary(lm.fit5)Call:lm(formula = medv ~ poly(lstat, 5))Residuals:     Min       1Q   Median       3Q      Max-13.5433  -3.1039  -0.7052   2.0844  27.1153Coefficients:                 Estimate Std. Error t value Pr(>|t|)    (Intercept)       22.5328     0.2318  97.197  < 2e-16 ***poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 5.215 on 500 degrees of freedomMultiple R-squared:  0.6817,    Adjusted R-squared:  0.6785F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

0 0