学好机器学习必会的统计学知识（第二篇）

来源：互联网发布：网络兼职正规网站编辑：程序博客网时间：2024/05/15 14:26

引言

在机器学习应用中，我们不可能离开数据。没有了数据，机器学习算法就像没有了灵魂。更好地理解数据，可以使我们把它更好地应用在机器学习上。在这篇文章中，我会介绍一些在统计学中，理解数据的一些重要概念，从而使大家更准确地操作数据，玩转数据。

注意：在这篇文章中会涉及到很多名词和定义，我就直接用英文了，因为这更加容易理解，翻译成汉语以后会让人更加混乱了。

Populations and Parameters

A population is any large collection of objects or individuals, such as Americans, students, or trees about which information is desired.

A parameter is any summary number, like an average or percentage, that describes the entire population.

下面，我举个例子来说明Populations and Parameters.

我们想要知道中国所有男人体重的平均值(μ)。这里，population是所有的中国男人，而parameter是体重的平均值。
我们想要知道中国所有大学生吸烟的比例(p)。这里，population是所有的中国大学生，而parameter是吸烟比例。
但不幸的是，我们几乎不可能知道population的parameter. 对于上面的那个例子来说，我们不可能去调查所有中国男人的体重，然后去求平均值。因此，我们只能去估算population的parameter.

Samples and statistics

A sample is a representative group drawn from the population.

A statistic is any summary number, like an average or percentage, that describes the sample.

还用上面的例子来说明问题。

这回我们只选择具有代表性的100个中国男人，求出他们的平均值x¯. 从而来估计μ.
这回我们只选择具有代表性的100个大学生，求出他们吸烟的比例(̂ p), 从而来估计p.
上面的100个大学生就是一个sample，求出的p̂ 就是sample的一个statistic.
因为sample的大小是可控的，因此我们能计算它的任何一个statistic. 从而我们用这个sample statistic去估算未知的population parameter.
有两种方式可以估算population parameter，它们分别是Confidence intervals 和 hypothesis tests. 下面，我来分别介绍这两种方法。

t-based Confidence Interval for the Mean

我们可以用t-interval来估算population mean μ. 下面，我来给出它的定义：

When the population standard deviation σ is not known, an interval estimate for the population mean μ with confidence level 1−α is given by :

$x ¯ \pm t α / 2, n - 1 (s n ‾ \sqrt)$

tα/2,n−1：它取决于sample size n通过计算n−1, 即degrees of freedom. 也取决于confidence level (1−α)∗100, 通过求出α2。
sn√：这个整体叫做”standard error“. 它实际上就是 estimated standard deviation of all the possible sample means.
很明显，sample mean x¯ 和 sample standard deviation s以及sample size n都可以很容易从sample data中获得。现在，我们只需要求出tα/2,n−1就行了。

t-based Confidence Interval for the Mean

要想求出t值，我们可以查询T-Table或用一些统计软件。但前提是我们要给出degrees of freedom 和 α/2.

T-Table

现在，我们定义confidence level为90%，因此α/2为0.05. 假设我们的sample size为15，因此degrees of freedom为15 - 1 = 14. 通过查询T-Table，我们的t0.05,14=1.761. 那么现在，如果给定你sample data，我们就可以求出Confidence Interval了。这里，我就不给出数据集了。假设我们求出的区间为(3.43, 3.68)，这说明我们有90%的自信population mean在这个区间内。

影响t-interval宽度的因素

通过对上面公式的变换，我们可以得出区间的宽度为：

Width = 2 \times t α / 2, n - 1 (s n ‾ \sqrt)

通过这个公式，我们就可以找出影响宽度的因素了。

随着sample mean增加，宽度不变。也就是说，sample mean并不影响区间的宽度。
随着sample standard deviation s 减少，区间的宽度减小。
随着我们减小confidence level，t值减小，因此区间宽度减小。
随着我们增加sample size，区间宽度减小。这是一个我们最容易控制的因素，唯一的花费就是我们的时间和金钱。

Hypothesis Testing

hypothesis testing一般包括下面3个步骤：

Making an initial assumption
Collecting evidence (data).
Based on the available evidence (data), deciding whether to reject or not reject the initial assumption.

hypothesis testing的两种错误类型：

hypothesis testing 错误

Type I error: The null hypothesis is rejected when it is true.

Type II error: The null hypothesis is not rejected when it is false.

进行Hypothesis Testing 有两种方法，一种是Critical value 方法，另一种是P-value approach. 下面，我来分别介绍这两种方法。

Hypothesis Testing (Critical value approach)

critical value方法比较observed test statistic和critical value，如果test statistic比critical value更加极端，那么null hypothesis is rejected. 如果test statistic并没有critical value极端，那么null hypothesis is not rejected.

在hypothesis testing中，出现type I error的概率叫做significance level，用α表示。

用Critical value方法进行任何一个Hypothesis Testing都包含下面四个步骤：

定义null hypotheses 和 alternative hypotheses
假设null hypothesis is True, 用sample data计算test statistic. 如果进行的hypothesis test 是针对population mean μ 的，那么计算test statistic的公式为：t∗=x¯−μs/n√
找到critical value
比较critical value 和 test statistic的大小

Hypothesis Testing (P-value approach)

P-value代表的是一个概率，它假设null hypothesis是True的情况下，在alternative hypothesis方向上出现一个比我们sample data的test statistic更极端的test statistic的概率。如果P-value是小于（或等于）α，那么null hypothesis is rejected. 如果P-value是大于α，那么null hypothesis is not rejected.

用P-value方法进行任何一个Hypothesis Testing都包含下面四个步骤：

定义null hypotheses 和 alternative hypotheses
假设null hypothesis is True, 用sample data计算test statistic. 如果进行的hypothesis test 是针对population mean μ 的，那么计算test statistic的公式为：t∗=x¯−μs/n√
找出 p-value值
设置significance level α，即出现Type I error的概率，通常为0.01, 0.05, or 0.10. 然后比较p-value和α

Right-tailed test

我用一个具体的例子并用具体的R代码来演示上面两个方法。假设我们的sample为25个运动员，他们每个人的身高如下：

170 167 174 179 179
156 163 156 187 156
183 179 174 179 170
156 187 179 183 174
187 167 159 170 179

我想知道整个中国运动员的平均身高是否大于170，因此我定义下面的hypotheses.

1、定义null hypotheses 和 alternative hypotheses

H0:μ=170
H1:μ>170

2、计算test statistic的公式为：t∗=x¯−μs/n√

height <- c(170,167,174,179,179,156,163,156,187,156,183,179,174,179,170,156,187,179,183,174,187,167,159,170,179)# sample mean 'xbar'xbar <- mean(height) # 175.52# hypothesized value 'mu'mu <- 170# sample standard deviation 's's <- sd(height) # 10.31# sample size 'n'n <- 25# test statistic 't't <- (xbar - mu) / (s / sqrt(n)) # 1.22

3、找到critical value

找到critical value有两种方法，一个是用T-Table，另一个是用统计学软件。但是，无论哪种方法，我们都需要degrees of freedom—n−1和significance level—α

# significance level, 如果这个值是不可能大于1的，我们在小数点前不用加0alpha <- .05# 建议查看官方文档qt函数t.alpha <- qt(1−alpha, df=n−1) # 1.711

4、比较critical value 和 test statistic的大小

结论：上面，我们已经求出critical value为1.711和test statistic为1.22。由于1.22 < 1.711，那么我不能reject the null hypothesis. 换句话说，test statistic没有在”critical region.”中，我没有足够的证据表明中国运动员的平均身高是大于170的。有一点我要说明白，不同的significance level有可能会导致不同的结果。

Right-tailed test

下面，我用P-value方法来进行Hypothesis Testing. 两种方法的前2个步骤是一样的，我就直接从第3步开始了。

3、找出 p-value值

想找出p-value的值，也就是找到从test statistic到正无穷曲线下面的面积。

# 这里t为test statistic，n为sample size，上面我已经计算过了pval = pt(t, df=n−1, lower.tail=FALSE) # 0.117

4、比较p-value和α

结论： 由于p-value的值为0.117大于α=0.05. 那么我不能reject the null hypothesis. 换句话说，我没有足够的证据表明中国运动员的平均身高是大于170的。

Right-tailed test

Left-tailed test

sample data 如下：

11.5 11.8 15.7 16.1 14.1 10.5
15.2 19.0 12.8 12.4 19.2 13.5
16.5 13.5 14.4 16.7 10.9 13.0
15.1 17.1 13.3 12.4 8.5 14.3
12.9 11.1 15.0 13.3 15.8 13.5
9.3 12.2 10.3

我想要知道打药物作物的平均寿命是否比正常的平均寿命15.7要小。因此因此我定义下面的hypotheses.

H0:μ=15.7
H1:μ<15.7

life <- c(11.5,11.8,15.7,16.1,14.1,10.5,15.2,19.0,12.8,12.4,19.2,13.5,16.5,13.5,14.4,16.7,10.9,13.0,15.1,17.1,13.3,12.4,8.5,14.3,12.9,11.1,15.0,13.3,15.8,13.5,9.3,12.2,10.3)# sample mean 'xbar'xbar <- mean(life) # 13.66# hypothesized value 'mu'mu <- 15.7# sample standard deviation 's's <- sd(life) # 2.54# sample size 'n'n <- 33# test statistic 't't <- (xbar - mu) / (s / sqrt(n)) # -4.60# significance level, 如果这个值是不可能大于1的，我们在小数点前不用加0alpha <- .05# 建议查看官方文档qt函数t.alpha <- -qt(1-alpha, df=n-1) # -1.6939

结论：上面，我们已经求出critical value为-1.6939和test statistic为-4.60。由于-4.60 < -1.6939，那么我可以rejects the null hypothesis. 换句话说，test statistic在”critical region.”中，我有足够的证据表明打药物作物的平均寿命比正常的平均寿命15.7要小

Left-tailed test

想找出p-value的值，也就是找到从test statistic到负无穷曲线下面的面积。

# 这里t为test statistic，n为sample size，上面我已经计算过了pval = pt(t, df=n-1) # 3.174244e-05

结论：由于p-value的值为3.174244e-05小于α=0.05. 那么我可以reject the null hypothesis. 换句话说，我有足够的证据表明打药物作物的平均寿命比正常的平均寿命15.7要小

Left-tailed test

Two-tailed test

sample data如下：

7.65 7.60 7.65 7.70 7.55
7.55 7.40 7.40 7.50 7.50

我想知道飞机上的一个零件大小的平均值是否为7.5. 因此因此我定义下面的hypotheses.

H0:μ=7.5
H1:μ≠7.5

size <- c(7.65,7.60,7.65,7.70,7.55,7.55,7.40,7.40,7.50,7.50)# sample mean 'xbar'xbar <- mean(size) # 7.55# hypothesized value 'mu'mu <- 7.5# sample standard deviation 's's <- sd(size) # 0.1027# sample size 'n'n <- 10# test statistic 't't <- (xbar - mu) / (s / sqrt(n)) # 1.54# significance level, 如果这个值是不可能大于1的，我们在小数点前不用加0alpha <- .05# 建议查看官方文档qt函数t.half.alpha = qt(1−alpha/2, df=n−1) # 2.2622c(−t.half.alpha, t.half.alpha) # [1] −2.2622  2.2622

结论：上面，我们已经求出critical value为−2.2622和2.2622，而test statistic为1.54。由于1.54既不大于2.2622也不小于-2.2622，因此我不能rejects the null hypothesis. 换句话说，test statistic不在”critical region.”中，我没有足够的证据表明零件大小的平均值不为7.5

Two-tailed test

想找出p-value的值，也就是找到从负test statistic到负无穷曲线下面的面积加上test statistic到正无穷曲线下面的面积。

# 这里t为test statistic，n为sample size，上面我已经计算过了# 你要注意你的test statistic的值是大于0还是小于0，从而决定lower.tail是True还是Falsepval <- 2 * pt(t, df=n-1, lower.tail=FALSE) # 0.158

结论：由于p-value的值为0.158大于α=0.05. 那么我不能reject the null hypothesis. 换句话说，我没有足够的证据表明零件大小的平均值不为7.5

Two-tailed test

无论我用什么方法，Hypothesis Testing的结果都是一样的！！！

Chi-Square Tests

下面我用Chi-Square Test来测试两个变量之间是否为独立的？

Null Hypothesis: The two categorical variables are independent.
Alternative Hypothesis: The two categorical variables are dependent.
用下面的公式来计算chi-square test statistic：

χ 2 = \sum (O - E) 2 / E

O: observed frequency
E: expected frequency under the null hypothesis，计算公式如下：

E = row total \times column total sample size

接下来，我们比较chi-square test statistic χ2和degree of freedom = (r - 1) (c - 1)的critical value χ2α，如果χ2>χ2α，那么reject the null hypothesis.

Chi-Square测试变量之间独立性实例

在R内置的数据集survey中，其中有两个category变量，一个是Exer，一个是Smoke. 下面，我用Chi-Square来测试这两个变量之间是否独立。

library(MASS)       # load the MASS packagetbl = table(survey$Smoke, survey$Exer)tbl                 # the contingency table        Freq None Some   Heavy    7    1    3   Never   87   18   84   Occas   12    3    4   Regul    9    1    7chisq.test(tbl)    Pearson's Chi-squared testdata:  tblX-squared = 5.4885, df = 6, p-value = 0.4828Warning message:In chisq.test(tbl) : Chi-squared approximation may be incorrect

由于p-value的值为0.4828大于.05 significance level，因此我们不能reject the null hypothesis，也就是说，smoking habit是独立于exercise level的。

引用

全文总结自：https://onlinecourses.science.psu.edu/statprogram/review_of_basic_statistics

1 0