统计学习导论 Chapter3--Linear Regression

来源：互联网发布：网络信息编辑：程序博客网时间：2024/06/07 06:04

Book: An Introduction to Statistical Learning
with Applications in R
http://www-bcf.usc.edu/~gareth/ISL/

本章主要介绍线性回归，这个方法很古老也很经典
这里先上一个广告预算和销售收入的数据图
这里写图片描述

3.1 Simple Linear Regression
Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable 单个变量的分析

we can regress sales onto TV by fitting the model
这里写图片描述

当我们用训练数据拟合模型，得到了 model coefficients，那么可以使用下面的模型来预测 future sales
这里写图片描述

3.1.1 Estimating the Coefficients 如何估计这些模型参数了
我们有 n 个训练数据
这里写图片描述

我们定义 residual sum of squares (RSS)
这里写图片描述
定义损失函数，使用 least squares 方法最小化损失函数，得到 coefficient estimates

这里写图片描述

3.1.2 Assessing the Accuracy of the Coefficient Estimates
如何评估我们估计的参数值有多准确了？
假定 X和Y 的 true relationship 的形式如下：这里写图片描述
如果 f 使用一个线性模型来近似，我们可以将 X和Y 的 relationship 的形式如下：

公式（3.5）对应的模型定义了 population regression line，它是对 X和Y 的 true relationship 的 best linear approximation
least squares regression coefficient estimates (3.4) 对应的拟合线我们称之为 least squares line (3.2)
这里写图片描述
Notice that different data sets generated from the same true model result in slightly different least squares lines, but the unobserved
population regression line does not change.

第一眼看上去， the population regression line 和 the least squares line 的差异性是细微的令人困惑的。我们只有一个数据集，两条不同的线描述同一个数据集的 the predictor and the response 的关系是什么意思了？
Fundamentally, the concept of these two lines is a natural extension of the standard statistical approach of using information from a sample to estimate characteristics of a large population.
这两条线的概念是统计学习方法用采样数据来估计 a large population 的一些特性的一个自然延伸。例如，假定我们对一个随机变量 Y 的 population mean µ 感兴趣，但是 µ 是未知的，但是我们有对 Y 的一组采样值 n observations，我们可以用这些观测值来估计均值 µ ，一个合理的估计是这里写图片描述
The sample mean and the population mean are different,但是通常 sample mean 可以对 population mean 提供一个很好的估计。类似的， the unknown coefficients β 0 and β 1 in linear regression define the population regression line,我们对这些参数使用（3.4）进行估计，这些参数估计定义了 least squares line

linear regression 和随机变量均值的估计都涉及到一个概念： bias 偏差。如果我们用样本均值 sample mean µ^ 来估计 µ，这个估计就是 unbiased，从平均的意义上来说，我们期望µ^ 等于 µ，这究竟是什么意思了？对于某一特定观测数据集，µ^可能 overestimate µ，对另一观测数据集，µ^可能 underestimate µ。但是如果我们的观测样本数量足够大，那么这个估计的均值就完全等于µ。所以一个无偏估计器对于要估计的参数没有系统的误差。无偏属性对于用（3.4）得到的最小二乘参数估计同样成立：如果我们在某一特定数据集上估计 β 0 和 β 1，我们的估计结果可能不会完全等于 β 0 和 β 1。但是如果我们的数据集足够的大，那么这个估计值就完全等于参数的真值。

我们继续随机变量 Y 的均值 µ 的估计。一个很自然的问题就是作为 µ 的估计 sample mean µ^ 到底有多准确？我们知道当观测的数据很多时，我们的估计值 µ^ 会很接近真值 µ，但是对于单个估计值 µ^ 它可能小于或大于真值 µ。那么这个估计值 µ^ 离真值 µ 到底有多远了？这里我们通过计算 µ^ 的标准差 standard error SE(µ^)来回答这个问题，首先我们来看看下面的公式：
这里写图片描述
其中 σ 是 standard deviation
简单的来说，标准差 standard error 告诉我们估计值µ^ 和真值 µ 偏差的均值，上面的公式也告诉我们随着观测数据的增加，n 的增大，这个deviation是如何减小的。同样的方式，我们也可以估计 β0^ and β1^ 分别离各自的真值有多远？其 standard errors 计算如下所示
这里写图片描述
这个公式成立的条件是 each observation are uncorrelated with common variance。现实中并不能满足这个条件，但是这个公式仍然可以给出一个很好的近似估计。
Standard errors 可以用于计算 confidence intervals。一个 95%的 confidence interval 被定义为估计值在这个范围内以95%的概率包含参数的真值。对于 linear regression 来说， the 95% confidence interval for β 1 形式如下：
这里写图片描述

Standard errors 也可以用于参数的 hypothesis tests，最常用的 hypothesis test 涉及测试 the null hypothesis of
这里写图片描述
因为如果 β1 = 0，那么 X和 Y 之间就没有相关性。为了测试 the null hypothesis，我们需要确认我们对 β1 的估计 β1^ 是否离 0 足够的远，这样我们可以确保 β1 不是 0。但是多远算够了？.How far is far enough? 这当然依赖于 β1^ 的精度，也就是依赖于 SE( β1^ )，如果 SE( β1^ ) 足够的小，那么即使 β1^相对较小的值也说明 β1 不等于0，也就是说 X和Y 之间存在关联性。相反，如果 SE( β1^ ) 足够大，那么 β1 的绝对值必须很大才能让我们拒绝 the null hypothesis。实际中，我们使用下面公式计算 t-statistic
这里写图片描述
它计算 β1^ 距离 0 的 standard deviations，如果 X 和 Y 之间没有相关性，那么我们期望上面的公式有一个 n−2 degrees of freedom 的 t-distribution,
t-distribution 是一个钟形 bell shape，当n大于 30时，它就很像正态分布。所以，假定 β1=0，那么计算任何观测值大于等于 t 的绝对值。我们将这个概率称之为 p-value。简单的来说，我们解释 p-value 如下：一个小的 p-value 显示不是偶然因素使我们观察到输入输出之间的相关性。所以当我们看到一个小的 p-value，我们可以得出的结论是：输入输出之间存在相关性。我们 reject the null hypothesis—就是当 p-value 足够小，我们声称 X 和 Y 存在相关性。 Typical p-value
cutoffs for rejecting the null hypothesis are 5% or 1%. When n = 30, these correspond to t-statistics (3.14) of around 2 and 2.75, respectively.
这里写图片描述

3.1.3 Assessing the Accuracy of the Model
如何评估模型的拟合精度了？这里我们介绍对线性回归拟合质量的评估的两个相关 quantities： the residual standard error (RSE) and the R2 statistic
Residual Standard Error
这里写图片描述
RSE is an estimate of the standard deviation of . Roughly speaking, it is the average amount that the response will deviate from the true regression line

R2 Statistic
这里写图片描述
R2 measures the proportion of variability in Y that can be explained using X. An R2 statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression. A number near 0 indicates that the regression did not explain much of the variability in the response; this might occur because the linear model is wrong, or the inherent error σ2 is high, or both.

3.2 Multiple Linear Regression
Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable 单个变量分析
这里写图片描述

3.2.1 Estimating the Regression Coefficients
多变量参数估计还是使用 least squares approach，只不过需要使用矩阵来表示更简洁，所以这里我们就可以给出具体推导
当我们进行multiple linear regression，我们主要关注以下四个问题：
1. Is at least one of the predictors X 1 ,X 2 ,…,X p useful in predicting
the response?
2. Do all the predictors help to explain Y , or is only a subset of the
predictors useful?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict,
and how accurate is our prediction?
后面的讨论都是围绕这个四个问题展开的。

3.3 Other Considerations in the Regression Model
3.3.1 Qualitative Predictors
不是定量描述变量，而是定性描述变量 predictors are qualitative
这里写图片描述

3.3.2 Extensions of the Linear Model
线性模型有两个假设：additive and linear 在实际问题中，有时不满足这两个假设
所以有时需要我们去掉这两个假设：
Removing the Additive Assumption
这里写图片描述

Non-Linear Relationships
这里写图片描述

3.3.3 Potential Problems
使用线性回归模型可能存在的问题
1. Non-linearity of the response-predictor relationships.
2. Correlation of error terms.
3. Non-constant variance of error terms.
4. Outliers.
5. High-leverage points.
6. Collinearity.
这里做了些简要的分析，不是本书关注的重点

阅读全文

0 0