R语言apply函数族笔记

来源：互联网发布：我怎样解题知乎编辑：程序博客网时间：2024/05/21 16:54

为什么用apply

因为我是一个程序员，所以在最初学习R的时候，当成“又一门编程语言”来学习，但是怎么学都觉得别扭。现在我的看法倾向于，R不是一种通用型的编程语言，而是一种统计领域的软件工具。因此，不能用通用型编程的思维来设计R代码。在Andrew Lim关于R和Python的对比回答中，R是一种面向数组(array-oriented)的语法，它更像数学，方便科学家将数学公式转化为R代码。而Python是一种通用编程语言，更工程化。在使用R时，要尽量用array的方式思考，避免for循环。不用循环怎么实现迭代呢？这就需要用到apply函数族。它不是一个函数，而是一族功能类似的函数。

概述

apply系列函数的基本作用是对数组（array，可以是多维）或者列表（list）按照元素或元素构成的子集合进行迭代，并将当前元素或子集合作为参数调用某个指定函数。vector是一维的array，dataframe可以看作特殊的list。

这些函数间的关系

作用目标在每个元素上应用在子集合上应用arrayapplytapplylistlapply(...)by

其中 lapply(...)包括一族函数

lapply   |   |-> 简化版: sapply   |             | -> 可设置返回值模板: vapply   |             |-> 多变量版: mapply   |   |-> 递归版: rapply

另外vector比较奇怪，vector是一维的array，但是却不全是和array使用相同的函数。在按元素迭代的情况下，使用和list一样的 lapply函数；而在按子集合迭代的情况下，tapply和 by都能用，只是返回值形式不同。

功能与语法描述

apply

apply(array, margin, FUN, ...)

在array上，沿margin方向，依次调用 FUN。返回值为vector。margin表示数组引用的第几维下标（即array[index1, index2, ...]中的第几个index），1对应为1表示行，2表示列，c(1,2)表示行列。margin=1时，apply(a, 1, sum)等效于下面的操作

a <- array(c(1:24), dim=c(2,3,4))result=c()for (i in c(1:dim(a)[1])) {    result <- c(result, sum(a[i,,]))}

经实测，只能用在二维及以上的array上，不能用在vector上（如果要应用于vector，请使用 lapply或 sapply）。以matrix为例，如下

> m <- matrix(c(1:10), nrow=2)> m  [,1] [,2] [,3] [,4] [,5][1,]    1    3    5    7    9[2,]    2    4    6    8   10> apply(m, 1, sum)[1] 25 30> apply(m, 2, sum)[1]  3  7 11 15 19

tapply

tapply(array, indices, margin, FUN=NULL, ...)

按 indices中的值分组，把相同值对应下标的array中的元素形成一个集合，应用到 FUN。类似于group by indices的操作。如果FUN返回的是一个值， tapply返回vector；若 FUN返回多个值， tapply返回list。vector或list的长度和 indices中不同值的个数相等。

当 FUN为 NULL的时候，返回一个长度和array中元素个数相等的vector，指示分组的结果，vector中相等的元素所对应的下标属于同一组。例如，返回c(1, 2, 1, 3, 2), 表示根据传入的indices，第1、3个元素作为一组，第2、5个元素作为一组，第4个元素作为一组。

一维array的例子（即vector）

> v <- c(1:5)> ind <- c('a','a','a','b','b')> tapply(v, ind)[1] 1 1 1 2 2> tapply(v, ind, sum)a b 6 9 > tapply(v, ind, fivenum)$a[1] 1.0 1.5 2.0 2.5 3.0$b[1] 4.0 4.0 4.5 5.0 5.0

二维array的例子（即matrix）

> m <- matrix(c(1:10), nrow=2)> m     [,1] [,2] [,3] [,4] [,5][1,]    1    3    5    7    9[2,]    2    4    6    8   10> ind <- matrix(c(rep(1,5), rep(2,5)), nrow=2)> ind     [,1] [,2] [,3] [,4] [,5][1,]    1    1    1    2    2[2,]    1    1    2    2    2> tapply(m, ind) [1] 1 1 1 1 1 2 2 2 2 2> tapply(m, ind, mean)1 2 3 8 > tapply(m, ind, fivenum)$`1`[1] 1 2 3 4 5$`2`[1]  6  7  8  9 10

by

by(dataframe, INDICES, FUN, ..., simplify=TRUE)

by可以当成dataframe上的 tapply。 indices应当和dataframe每列的长度相同。返回值是by类型的object。若simplify=FALSE，本质上是个list。

> df <- data.frame(a=c(1:5), b=c(6:10))> ind <- c(1,1,1,2,2)> res <- by(df, ind, colMeans) > resind: 1a b 2 7 ------------------------------------------------------------ ind: 2  a   b 4.5 9.5 > class(res)[1] "by"> names(res)[1] "1" "2"

lapply

lapply(list, FUN, ...)

在 list上逐个元素调用 FUN。可以用于dataframe上，因为dataframe是一种特殊形式的list。例

> lst <- list(a=c(1:5), b=c(6:10))> lapply(lst, mean)$a[1] 3$b[1] 8> lapply(lst, fivenum)$a[1] 1 2 3 4 5$b[1]  6  7  8  9 10

sapply

sapply(list, FUN, ..., simplify, USE.NAME=TRUE)

比 lapply多了一个 simplify参数。如果 simplify=FALSE，则等价于lapply。否则，在上一种情况的基础上，将 lapply输出的list简化为vector或matrix。例

> lst <- list(a=c(1:5), b=c(6:10))> sapply(lst, mean)a b 3 8 > sapply(lst, fivenum)  a  b[1,] 1  6[2,] 2  7[3,] 3  8[4,] 4  9[5,] 5 10

vapply

vapply(list, FUN, FUN.VALUE, ..., USE.NAME=TRUE)

vapply类似于 sapply，但是提供了第三个参数 FUN.VALUE用以指明返回值的形式，可以看作返回值的模板。例

 > lst <- list(a=c(1:5), b=c(6:10)) > res <- vapply(lst, function(x) c(min(x), max(x)), c(min.=0, max.=0)) > res      a  b min. 1  6 max. 5 10

mapply

mapply(FUN, ..., MoreArgs=NULL, SIMPLIFY=TRUE, USE.NAMES=TRUE)

mapply是多变量版的 sapply，参数(...)部分可以接收多个数据， mapply将FUN应用于这些数据的第一个元素组成的数组，然后是第二个元素组成的数组，以此类推。要求多个数据的长度相同，或者是整数倍关系。返回值是vector或matrix，取决于FUN返回值是一个还是多个。

> mapply(sum, list(a=1,b=2,c=3), list(a=10,b=20,d=30)) a  b  c 11 22 33 > mapply(function(x,y) x^y, c(1:5), c(1:5))[1]    1    4   27  256 3125> mapply(function(x,y) c(x+y, x^y), c(1:5), c(1:5))     [,1] [,2] [,3] [,4] [,5][1,]    2    4    6    8   10[2,]    1    4   27  256 3125

rapply

rapply(list, FUN, classes="ANY", deflt=NULL, how=c("unlist", "replace", "list"), ...)

rapply是递归版的 lappy。基本原理是对list作遍历，如果其中有的元素仍然是list，则继续遍历；对于每个非list类型的元素，如果其类型是classes参数指定的类型之一，则调用 FUN。classes="ANY"表示匹配所有类型。

how参数用来指定操作方式，有三种：

"replace" 直接用调用 FUN后的结果替换原list中原来的元素
"list" 新建一个list，元素类型在 classes中的，调用 FUN；不在 classes中的类型，使用deflt。会保留原始list的结构。
"unlist" 相当于对"list"模式下的结果调用 unlist(recursive=TRUE)

 > lst <- list(a=list(aa=c(1:5), ab=c(6:10)), b=list(ba=c(1:10)))> lst$a$a$aa[1] 1 2 3 4 5$a$ab[1]  6  7  8  9 10$b$b$ba [1]  1  2  3  4  5  6  7  8  9 10> rapply(lst, sum, how='list')$a$a$aa[1] 15$a$ab[1] 40$b$b$ba[1] 55> rapply(lst, sum, how='unlist')a.aa a.ab b.ba   15   40   55

第二个是关于 classes和 deflt参数使用的例子

> lst2$a$a$aa[1] 1 2 3 4 5$a$ab[1]  6  7  8  9 10$b$b$ba[1] "I am a string"> rapply(lst2, sum, how='unlist')Error in .Primitive("sum")("I am a string", ...) :   invalid 'type' (character) of argument> rapply(lst2, sum, classes=c('integer'), deflt=-1, how='unlist')a.aa a.ab b.ba   15   40   -1 > rapply(lst2, nchar, classes=c('character'), deflt=as.integer(NA), how='unlist')a.aa a.ab b.ba   NA   NA   13

eapply

environment上的的apply。从没用过environment，暂时不研究了。

应用

tapply实现crosstable功能

以一个例子演示。原始数据为按年份year、地区loc和商品类别type进行统计的销售量。我们要制作两个销售总量的crosstable，一个以年份为行、地区为列，一个以年份为行，类别为列。

> df <- data.frame(year=kronecker(2001:2003, rep(1,4)), loc=c('beijing','beijing','shanghai','shanghai'), type=rep(c('A','B'),6), sale=rep(1:12))> df  yearloc type sale1  2001  beijing A 12  2001  beijing B 23  2001 shanghai A 34  2001 shanghai B 45  2002  beijing A 56  2002  beijing B 67  2002 shanghai A 78  2002 shanghai B 89  2003  beijing A 910 2003  beijing B1011 2003 shanghai A1112 2003 shanghai B12> tapply(df$sale, df[,c('year','loc')], sum)    locyearbeijing shanghai  2001 3  7  200211 15  200319 23> tapply(df$sale, df[,c('year','type')], sum)    typeyear A  B  2001  4  6  2002 12 14  2003 20 22

Reference

Andrew Lim

https://screamyao.wordpress.com/2011/05/03/various-apply-functions-in-r-explained/

https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/

http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm#apply

原文 http://www.cnblogs.com/aquastone/p/r-apply.html

0 0