4th Oct 2014:R语言中的factor类型

来源：互联网发布：淘宝女装店铺简介范文编辑：程序博客网时间：2024/06/07 12:06

Factor在回归分析中的应用

今天做回归的时候要用到这个函数，被help文档以及不专业的翻译搞昏了factor这个函数到底在干嘛，真是佩服将其翻译成因子。
实际上很简单，factor(C(…))是用来生成两类数据的：定性数据&定序数据！

具体用法如下：

生成定性数据

```colour <- c('G', 'G', 'R', 'Y', 'G', 'Y', 'Y', 'R', 'Y')[1]col<-factor(colour)#直接将这个colour这个向量变成一个定性数据的向量[2]col1 <- factor(colour, levels = c('G', 'R', 'Y'), labels = c('Green', 'Red', 'Yellow'))[3]col3 <- factor(colour, levels = c('G', 'R'))

让我们来看看[2]、[3]的结果感受一下吧：

```> col1> [1] Green  Green  Red    Yellow Green  Yellow Yellow Red    YellowLevels: Green Red Yellow> col3> [1] G    G    R    <NA> G    <NA> <NA> R    <NA>Levels: G R

生成定序数据：

``` score <- c('A', 'B', 'A', 'C', 'B') score1 <- ordered(score, levels = c('C', 'B', 'A'))

废话少说，直接上结果：

```> score1[1] A B A C BLevels: C < B < A

factor在生成之后也成了一种数据类型，factor最重要的一个应用之一就是我接下来呀讲到的回归分析中的应用。

对了！它就是虚拟变量，有了factor在R中实现就快方便多了！

举个例子：

首先导入一个美帝社会的数据集

hsb2 <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")

我们看看数据长什么样

```> str(hsb2)'data.frame':   200 obs. of  11 variables: $ id     : int  70 121 86 141 172 113 50 11 84 48 ... $ female : int  0 1 0 0 0 0 0 0 0 0 ... $ race   : int  4 4 4 4 4 4 3 1 4 3 ... $ ses    : int  1 2 3 3 2 2 2 2 2 2 ... $ schtyp : int  1 1 1 1 1 1 1 1 1 1 ... $ prog   : int  1 3 1 3 2 2 1 2 1 2 ... $ read   : int  57 68 44 63 47 44 50 34 63 57 ... $ write  : int  52 59 33 44 52 52 59 46 57 55 ... $ math   : int  41 53 54 47 57 51 42 45 54 52 ... $ science: int  47 63 58 53 53 63 53 39 58 50 ... $ socst  : int  57 61 31 56 61 61 61 36 51 51 ...

现在我们想知道人种之间是否会对写作能力有不同，我们要对他进行回归(虽然不固定其他因素就回归似乎说不过去，但是这是纯解释的例子，不要在意这些细节)

```# creating the factor variablehsb2$race.f <- factor(hsb2$race)is.factor(hsb2$race.f)summary(lm(write ~ race.f, data = hsb2))#让我们看看结果：#### Call:## lm(formula = write ~ race.f, data = hsb2)#### Residuals:##     Min      1Q  Median      3Q     Max## -23.055  -5.458   0.972   7.000  18.800#### Coefficients:##             Estimate Std. Error t value Pr(>|t|)## (Intercept)    46.46       1.84   25.22  < 2e-16 ***## race.f2        11.54       3.29    3.51  0.00055 ***## race.f3         1.74       2.73    0.64  0.52461## race.f4         7.60       1.99    3.82  0.00018 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 9.03 on 196 degrees of freedom## Multiple R-squared:  0.107,  Adjusted R-squared:  0.0934## F-statistic: 7.83 on 3 and 196 DF,  p-value: 5.78e-05

所以它自动把不同水平区分开了，是不是很轻松？

参考链接：

[1]http://statistics.ats.ucla.edu/stat/r/modules/dummy_vars.htm
[2]http://blog.sina.com.cn/s/blog_59f8748e01011in6.html

0 0