Purrr package for R is good for performance
来源:互联网 发布:林氏木业还是全友 知乎 编辑:程序博客网 时间:2024/05/22 02:10
Hadley’s project purrr
So, if you haven’t seen it, there’s some goodness over at github where Hadley Wickham has been
working to fill in some more of the holes in R should one want a more functional programming language
set of constructs to work with.
But, in true Hadley style, in addition to all of the functional programming syntactical goodness, the code is fast as well.
——more——
To install the package, which is not on CRAN as of this post, one need simply
# install.packages("devtools")
devtools::install_github("hadley/purrr")
Here is an example using purrr. The example sets to split a data frame into pieces, fit a model to each piece, summarise and extract R^2.
library(purrr) mtcars %>% split(.$cyl) %>% map(~ lm(mpg ~ wt, data = .)) %>% map(summary) %>% map_dbl("r.squared")
Here is another, more complicated example. It generates 100 random test-training splits, fits a model to each training split then evaluates based on the test split:
library(dplyr)randomgroup <- function(n, probs) { probs <- probs / sum(probs) g <- findInterval(seq(0, 1, length = n), c(0, cumsum(probs)), rightmost.closed = TRUE) names(probs)[sample(g)]}partition <- function(df, n, probs) { replicate(n, split(df, randomgroup(nrow(df), probs)), FALSE) %>% zip() %>% asdataframe()} msd <- function(x, y) sqrt(mean((x - y) ^ 2)) # Genearte 100 rbootandom test-training splitsboot <- partition(mtcars, 100, c(test = 0.8, training = 0.2))boot boot <- boot %>% mutate( # Fit the models models = map(training, ~ lm(mpg ~ wt, data = mtcars)), # Make predictions on test data preds = map2(models, test, predict), diffs = map2(preds, test %>% map("mpg"), msd)) # Evaluate mean-squared difference between predicted and actualmean(unlist(boot$diffs))
As Hadley writes about the philosophy for purrr, the goal is not to try and simulate Haskell in R: purrr does not implement currying or destructuring binds or pattern matching. The goal is to give you similar expressiveness to an FP language, while allowing you to write code that looks and works like R.
Instead of point free style, use the pipe, %>%, to write code that can be read from left to right.
Instead of currying, we use … to pass in extra arguments.
Anonymous functions are verbose in R, so we provide two convenient shorthands. For predicate functions, ~ .x + 1 is equivalent to function(.x) .x + 1. For chains of transformations functions, . %>% f() %>% g() is equivalent to function(.) . %>% f() %>% g().
R is weakly typed, so we can implement general zip(), rather than having to specialise on the number of arguments. (That said I still provide map2() and map3() since it’s useful to clearly separate which arguments are vectorised over).
R has named arguments, so instead of providing different functions for minor variations (e.g. detect() and detectLast()) I use a named argument, .first. Type-stable functions are easy to reason about so additional arguments will never change the type of the output.
Timings
OK, so how about some measurements of performance. Let us create a 10 x 10,000 matrix with one row for each combination of the levels in f.
# Some datanvars <- 10000nsamples <- 500sample_groups <- 5MAT <- replicate(nvars, runif(n=nsamples)) # And a grouping vector: f <- rep_len(1:sample_groups, nsamples)f <- LETTERS[f]
In pursuit of this, the first task is to calculate the mean for each group for all columns. First, a high order function in R
leveraging helpers.
# Settingsaggr_FUN <- meancombi_FUN <- function(x,y) "/"(x,y) # helper functionpasteC <- function(x,y) paste(x,y,sep=" - ") # aggregatesystem.time({temp2 <- aggregate(. ~ class, data = cbind.data.frame(class=f,MAT), aggr_FUN)})
which yields
user system elapsed
13.457 1.187 14.766
Here’s an approach with reshape
# reshape2library(reshape2)system.time({temp3 <- recast(data.frame(class=f,MAT),class ~ variable,id.var="class",aggr_FUN)})
which has
user system elapsed
1.945 0.454 2.525
7x faster. Finally, here is a purrr approach. Firstly, look at the elegance of the representation. Then look at the timings.
# purrr library(purrr)system.time({ tmp <- data.frame(class = f, MAT) %>% slicerows("class") %>% byslice(map, aggr_FUN)})
user system elapsed
0.512 0.043 0.569
Another 4x speedup, or 28x faster than the original approach with aggregate. Impressive. The purrr work deserves to be
looked at and picked up by R devs, as it is both elegant and performant.
All of this has resulted in
tmp[,1:10]Source: local data frame [5 x 10] class X1 X2 X3 X4 X5 X6 X7 X8 X91 A 0.5194124 0.5066943 0.5326734 0.5042122 0.4190162 0.4882796 0.4947138 0.4701085 0.49825352 B 0.5267829 0.4545410 0.4883640 0.4894278 0.4672661 0.4477106 0.4832262 0.4583598 0.47677733 C 0.4703151 0.4994032 0.4842406 0.4960585 0.5276044 0.4817216 0.4853307 0.5331066 0.48815274 D 0.5139762 0.5318747 0.5071466 0.4657025 0.4972884 0.4815889 0.5049296 0.4685044 0.55351975 E 0.5439962 0.4479991 0.4640088 0.4946168 0.4716724 0.5370196 0.5011706 0.5219855 0.5160875
- Purrr package for R is good for performance
- What is JMS good for?
- R语言中package ‘xxxx’ is not available (for R version 3.2.5)解决
- R语言 关于package'***' is not available(for R version 3.4.1)
- Performance Tuning for Linux(R) Servers
- 180. Good for good is natural, good for evil is manly.以德报德是常理,以德报怨大丈夫
- weblogic server with Jrockit. Good for better performance
- Why "INSERT ... ON CONFLICT DO NOTHING/UPDATE" Good for performance
- R包的安装错误ERROR: dependency ‘plyr’ is not available for package ‘reshape2’
- R包的安装错误ERROR: dependency ‘plyr’ is not available for package ‘reshape2’
- install.packages("simpleaffy") 报错package ‘simpleaffy’ is not available (for R version 3.3.1)
- Evaluating NoSQL performance: Which database is right for your data?
- A class which is good for downloading files
- a jquery tree which is good for use
- when more is not good for your work
- Is Good Code Enough for a Project to Be Successful?
- How lighter is good enough for a color?
- Here is a good blog spot for Android
- oracle goldengate error ogg-00664
- 虚拟机下安装Linux系统
- 字符串分割
- 基于jQuery的图片左右轮播,基本原理通用
- 打包JAR错误 Class files on classpath not found or not accessible for 问题的解决
- Purrr package for R is good for performance
- mybatis+mysql的批量更新
- 图像处理_DICOM医学图像处理
- 今天开始写自己的博客了,谢谢关照
- linux diff
- Linux IPC
- tcpdump安装的方法
- C++程序员面试总结
- spring整合zookeeper与dubbo的入门案例(一)