Using purrr with dplyr

来源：互联网发布：吉利知豆报价编辑：程序博客网时间：2024/06/06 02:35

purrr was finally
released on CRAN
last week. This package is focused on working with lists (and data
frames by the same token). However it is not a DSL for lists in the
way dplyr is a DSL for data frames. It aims at creating a "better
standard lib" focused on functional programming. Purrr should feel
like R programming and bring out the elegance of the language. That
said, purrr can be a nice companion to your dplyr pipelines especially
when you need to apply a function to many columns. In this post I show
how purrr's functional tools can be applied to a dplyr workflow.

dplyr provides mutate_each() and summarise_each() for the purpose
of mapping functions but I find that they are not as easy to use as
the rest of the interface. This is mostly because there is no easy way
to map a function to parts of your data frame. It's all columns or
nothing. Also, they introduce a custom notation for lambda functions that
can be a bit cumbersome. These are two areas where purrr shines in
comparison. And since the interface has been designed with pipes in
mind, purrr's functions integrate dplyr pipelines quite well.

Mapping to columns conditionally

One of my favourite functions in purrr is map_if(). It accepts a
predicate function or a logical vector that specifies which columns
should be mapped with a function. This makes it easy to apply a
function conditionally, as in the following snippet where we transform
all factors to a character vector:

library("purrr")library("dplyr")data(diamonds, package = "ggplot2")diamonds %>% map_if(is.factor, as.character) %>% str()

#> Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:#>  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...#>  $ cut    : chr  "Ideal" "Premium" "Good" "Premium" ...#>  $ color  : chr  "E" "E" "E" "I" ...#>  $ clarity: chr  "SI2" "SI1" "VS1" "VS2" ...#>  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...#>  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...#>  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...#>  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...#>  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...#>  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Mapping to specific columns

While cleaning a dataset, it is common to apply the same
transformation to many variables. For example, reversing a scale or
shifting it to zero. Instead of writing a long mutate() call with
those transformations, I prefer to do it in one go.

This can be done with map_at() which takes a vector of column
positions or column names. For example, let's assume you have written
two functions reverse_scale() and shift_to_zero() that should be
applied to specific variables. You record those variables in character
vectors just before starting the dplyr/purrr pipeline, and then add
the relevant map_at() calls.

to_reverse_vars <- c(  "cyl", "am", "vs",  "gear", "carb")to_zero_vars <- c(  "cyl", "gear", "carb")mtcars %>%  select(-disp) %>%  map_at(to_reverse_vars, reverse_scale) %>%  map_at(to_zero_vars, shift_to_zero)

Expanding one column to many with lmap()

lmap()'s story starts with
the mysterious tweet
and
the gist
that show up when you google "hadley monads". While I'm not sure I
really understand how it is monadic, lmap() is quite useful
to extend a data frame without having to deal with binds, merges or
having to define new column names.

Let's say you have a numeric variable that you want to discretise for
data exploration or modelling (for example, to use as pivot in a
ggplot facetting). There are several ways to cut a vector into
pieces. Ideally, the cutpoints should be derived from theory, but it's
often not possible or too time consuming to do so. In this case, I
like to create different categorisations and check if the results are
consistent (and investigate when they are not). Let's define two
cutting functions, one that tries to create categories with equal
sample sizes while the other just uses equal ranges to determine
cutpoints.

cut_equal_sizes <- function(x, n = 3, ...) {  ggplot2::cut_number(x, n, ...)}cut_equal_ranges <- function(x, n = 3, ...) {  cut(x, n, include.lowest = TRUE, ...)}

It'd be nice to "grow" the data frame at specific numeric columns in
such a way that that two news discretised variables appear just next
to them with appropriate column names. lmap() is adapted to this
because instead of applying a function to the vectors contained in a
data frame, it applies it to subsets of size 1 of that data
frame. This has several advantages:

You get the name of the vector as an attribute of the enclosing data
frame.
The usual mapping tools work on columns, so when you return a list
or a data frame of vectors, they'll try to stick these inside a
list-column, which is not what we want in this case. By comparison,
lmap() gives a data frame to a function and expects a data frame
in return and has no problem dealing with it when it has more than
one column.

Let's write a function to be mapped in such a way. This function
doesn't work with vectors but with vectors enclosed in a data
frame. It takes and returns a data frame.

cut_categories <- function(x, n = 3) {  # Record the name of the enclosed vector  name <- names(x)  # Create the new columns  x$cat_n <- cut_equal_sizes(x[[1]], n)  x$cat_r <- cut_equal_ranges(x[[1]], n)  # Adjusting the names of the new columns  names(x)[2:3] <- paste0(name, "_", n, names(x)[2:3])  x}

Then we just add a lmap() call to our data cleaning pipeline:

to_discretise_vars <- c(  "mpg", "disp", "drat",  "wt", "qsec")mtcars %>% lmap_at(to_discretise_vars, cut_categories) %>% str()

#> Classes 'tbl_df', 'tbl' and 'data.frame':    32 obs. of  21 variables:#>  $ mpg        : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...#>  $ mpg_3cat_n : Factor w/ 3 levels "[10.4,16.7]",..: 2 2 3 2 2 2 1 3 3 2 ...#>  $ mpg_3cat_r : Factor w/ 3 levels "[10.4,18.2]",..: 2 2 2 2 2 1 1 2 2 2 ...#>  $ cyl        : num  6 6 4 6 8 6 8 4 4 6 ...#>  $ disp       : num  160 160 108 258 360 ...#>  $ disp_3cat_n: Factor w/ 3 levels "[71.1,146]","(146,293]",..: 2 2 1 2 3 2 3 2 1 2 ...#>  $ disp_3cat_r: Factor w/ 3 levels "[70.7,205]","(205,338]",..: 1 1 1 2 3 2 3 1 1 1 ...#>  $ hp         : num  110 110 93 110 175 105 245 62 95 123 ...#>  $ drat       : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...#>  $ drat_3cat_n: Factor w/ 3 levels "[2.76,3.17]",..: 2 2 2 1 1 1 2 2 3 3 ...#>  $ drat_3cat_r: Factor w/ 3 levels "[2.76,3.48]",..: 2 2 2 1 1 1 1 2 2 2 ...#>  $ wt         : num  2.62 2.88 2.32 3.21 3.44 ...#>  $ wt_3cat_n  : Factor w/ 3 levels "[1.51,2.81]",..: 1 2 1 2 2 2 3 2 2 2 ...#>  $ wt_3cat_r  : Factor w/ 3 levels "[1.51,2.82]",..: 1 2 1 2 2 2 2 2 2 2 ...#>  $ qsec       : num  16.5 17 18.6 19.4 17 ...#>  $ qsec_3cat_n: Factor w/ 3 levels "[14.5,17]","(17,18.6]",..: 1 1 3 3 1 3 1 3 3 2 ...#>  $ qsec_3cat_r: Factor w/ 3 levels "[14.5,17.3]",..: 1 1 2 2 1 3 1 2 3 2 ...#>  $ vs         : num  0 0 1 1 0 1 0 1 1 1 ...#>  $ am         : num  1 1 1 0 0 0 0 0 0 0 ...#>  $ gear       : num  4 4 4 3 3 3 3 4 4 4 ...#>  $ carb       : num  4 4 1 1 2 1 4 2 2 4 ...

The data frame comes out of the pipeline with the new discretised
variables nicely arranged and named.

Mapping a function within groups

purrr is also able to deal with dplyr groupings. The groups can be
defined with either dplyr::by_group() or purrr::slice_rows(). To
apply a function to all columns within groups, just combine a mapping
function with the by_slice() adverb:

mtcars %>%  slice_rows("cyl") %>%  by_slice(map, ~ .x / sum(.x))

0 0