One function to run them all… Or just eval

来源：互联网发布：c语言的广泛应用编辑：程序博客网时间：2024/05/16 07:14

When I was writing a sysBio package, I needed a function that can be customized based on the user specified model. Such function would include user specified variables (parameters, rates, etc.) with assigned values, functions/expression that use those variables, and the ordinary differential equations that describe the model. After doing some reading, the eval function seemed like a logical choice. It was pretty easy to figure out how to use it and to implement the custom function I needed (if interested, you can see it in makeODEs.R and simulateModel.R).

This made me think about how easy it would be to use eval to create a general purpose function. Before I go any further, let me first say that I am aware of issues of using eval (security, harder readability and debugging, extra complexity, etc). If the hypothetical situation described below is a real-life problem I am working on, I would weight pros and cons before calling the presented eval approach the chosen approach. But this is just me being curious and we can pretend it is ok.

So, let’s say that we are working with people (users) whose goals are to create reports from different real-life data. Let’s also assume that these data are represented in tabular format and that reports represent summaries based on mathematical functions applied on the data presented in the tables. Additionally, let’s assume that our users have phobia from writing or learning any type of code (I have to add this, because my significant other had a very strong stand against eval and advocated for teaching the users to learn or even copy/paste the code in instead).

Clearly, using R, we could address users’ needs pretty easy – we could load and store users’ data in data frames, (check the data for consistency), use some type of ply function to summarize the data, and print the reports in the output files. Ignoring the read/check/print parts, we could create a summary in a single line command.

Now, let’s assume that users often want to change their summaries, for example every few months they remove one of the old summary functions and add one new . In order to keep users’ demands current, we would have to go and change our summary commands quite often.

This is where eval approach comes to play. Can we create a single, general function that will be able to handle different, specific users’ data summary requests?

The answer is yes.
However, note that the presented approach does not do any data evaluation and checking, nor it performs security check on user passed strings (which, in real-life, would be pretty big security concern).

We will assume that user’s data is already in the form of data frame (with stringsAsFactors set to FALSE). Furthermore, we will assume that users provide a list of variables (that correspond to table column names) over which they want to split and summarize the data (character vector), as well as a two-column table (data frame) which will contain the user specified name of the function used for summary (first column) and the function used for summary (second column).

As data example, we can use mtcars data set from the datasetslibrary. We will assume that the user is interested in information about gears and for each gear wants to calculate some custom value (defined as mpg+2*carb, total hp (defined as sum(hp)), and average mpg (defined as mean(mpg)):

ex1_df <- datasets::mtcars
ex1_vars <- "gear"
ex1_formulas <- data.frame(labs=c("cust_val", "total_hp", "avg_mpg"), ress=c("mpg+2*carb", "sum(hp)", "mean(mpg)"), stringsAsFactors = FALSE)

If we were coding the user's request, we could use the ddply function from the plyr package as follow:
ex1_trueResult <- plyr::ddply(ex1_df, .variable=c("gear"), function(x) data.frame(cust_val = x$mpg+2*x$carb, total_hp = sum(x$hp), avg_mpg = mean(x$mpg)))

The first thing I tried to do was to create a function that will recreate the ddply call:

 oneForMany_first <- function(df_to_use= NA, cols_fix = NA, cols_ops = NA){   if ((is.null(df_to_use))| (is.null(cols_fix))| (is.null(cols_ops))) stop("You did not specify all arguments!")   if ((!is.data.frame(df_to_use))| (!is.character(cols_fix))| (!is.data.frame(cols_ops))) stop("Wrong argument type(s)! First and third arguments need to be a data frame; second argument needs to be a vector of characters!")   if (ncol(cols_ops)>2) stop("Data frame defined by third argument should have only two columns!")   # Transform assignments into appropriate format col_names.sorted <- sort(colnames(df_to_use)) col_names.length <- length(col_names.sorted)   # DF needs to be in stringsAsFactors=FALSE mode for (jin 1:nrow(cols_ops)){   for (iin col_names.length:1){ cols_ops[j,2]<- stringr::str_replace_all(as.character(cols_ops[j,2]),col_names.sorted[i], paste("hlp[",as.character(i),"]",sep="")) }   for (iin 1:col_names.length){ cols_ops[j,2]<- stringr::str_replace_all(as.character(cols_ops[j,2]), paste("hlp\\[",as.character(i),"\\]",sep=""), paste("x$",col_names.sorted[i], sep="")) } }   # Create a vector of functions func_list <- parse(text = paste(as.character(cols_ops[,1])," = ", cols_ops[,2],sep=""))   # Run plyer plyr::ddply(df_to_use,.variable=cols_fix,function(x)data.frame(eval(func_list))) }
view rawoneForMany_first.R hosted with ❤ by GitHub

Running this function will create a summary, but only for avg_mpg:
ex1_res01 <- oneForMany_first(ex1_df, ex1_vars, ex1_formulas)

This is because when an expression vector is passed to eval, the result of evaluating is the result of evaluating the last element. To get around this issue, we can use a sapply command:

 oneForMany_second <- function(df_to_use= NA, cols_fix = NA, cols_ops = NA){   if ((is.null(df_to_use))| (is.null(cols_fix))| (is.null(cols_ops))) stop("You did not specify all arguments!")   if ((!is.data.frame(df_to_use))| (!is.character(cols_fix))| (!is.data.frame(cols_ops))) stop("Wrong argument type(s)! First and third arguments need to be a data frame; second argument needs to be a vector of characters!")   if (ncol(cols_ops)>2) stop("Data frame defined by third argument should have only two columns!")   # Transform assignments into appropriate format col_names.sorted <- sort(colnames(df_to_use)) col_names.length <- length(col_names.sorted)   # DF needs to be in stringsAsFactors=FALSE mode for (jin 1:nrow(cols_ops)){   for (iin col_names.length:1){ cols_ops[j,2]<- stringr::str_replace_all(as.character(cols_ops[j,2]),col_names.sorted[i], paste("hlp[",as.character(i),"]",sep="")) }   for (iin 1:col_names.length){ cols_ops[j,2]<- stringr::str_replace_all(as.character(cols_ops[j,2]), paste("hlp\\[",as.character(i),"\\]",sep=""), paste("x$",col_names.sorted[i], sep="")) } }   # Create a vector of functions func_list <- parse(text = paste(as.character(cols_ops[,1])," = ", cols_ops[,2],sep=""))   # Run plyer plyr::ddply(df_to_use,.variable=cols_fix,function(x)data.frame(sapply(func_list,function(y) eval(y)))) }
view rawoneForMany_second.R hosted with ❤ by GitHub

This time, all expressions have been evaluated and all calculations have been performed. However, due to the column names mismatches (the function did not assigned the proper names to the columns), the data is not in desired format and one would need to parse it in order to use it (which can be done, but it is not what we want).
ex1_res02 <- oneForMany_second(ex1_df, ex1_vars, ex1_formulas)

Similarly, if we use a list instead of a vector, all expressions are evaluated and all values are be calculated.

 oneForMany_third <- function(df_to_use= NA, cols_fix = NA, cols_ops = NA){   if ((is.null(df_to_use))| (is.null(cols_fix))| (is.null(cols_ops))) stop("You did not specify all arguments!")   if ((!is.data.frame(df_to_use))| (!is.character(cols_fix))| (!is.data.frame(cols_ops))) stop("Wrong argument type(s)! First and third arguments need to be a data frame; second argument needs to be a vector of characters!")   if (ncol(cols_ops)>2) stop("Data frame defined by third argument should have only two columns!")   # Transform assignments into appropriate format col_names.sorted <- sort(colnames(df_to_use)) col_names.length <- length(col_names.sorted)   # DF needs to be in stringsAsFactors=FALSE mode for (jin 1:nrow(cols_ops)){   for (iin col_names.length:1){ cols_ops[j,2]<- stringr::str_replace_all(as.character(cols_ops[j,2]),col_names.sorted[i], paste("hlp[",as.character(i),"]",sep="")) }   for (iin 1:col_names.length){ cols_ops[j,2]<- stringr::str_replace_all(as.character(cols_ops[j,2]), paste("hlp\\[",as.character(i),"\\]",sep=""), paste("x$",col_names.sorted[i], sep="")) } }   # Create a vector of functions func_list <- parse(text = paste("as.list(c(", paste(as.character(cols_ops[,1])," = ", cols_ops[,2],sep="",collapse=","), "))", sep=""))   # Run plyer plyr::ddply(df_to_use,.variable=cols_fix,function(x)data.frame(eval(func_list))) }
view rawoneForMany_third.R hosted with ❤ by GitHub

But the resulting data is also not in desired format and should be parsed.
ex1_res03 <- oneForMany_third(ex1_df, ex1_vars, ex1_formulas)

Now, let's try something slightly different. Let's use eval to create a function that returns a data frame with desired columns (user specified functions):

 oneForMany_fourth <- function(df_to_use= NA, cols_fix = NA, cols_ops = NA){   if ((is.null(df_to_use))| (is.null(cols_fix))| (is.null(cols_ops))) stop("You did not specify all arguments!")   if ((!is.data.frame(df_to_use))| (!is.character(cols_fix))| (!is.data.frame(cols_ops))) stop("Wrong argument type(s)! First and third arguments need to be a data frame; second argument needs to be a vector of characters!")   if (ncol(cols_ops)>2) stop("Data frame defined by third argument should have only two columns!")   # Transform assignments into appropriate format col_names.sorted <- sort(colnames(df_to_use)) col_names.length <- length(col_names.sorted)   # DF needs to be in stringsAsFactors=FALSE mode for (jin 1:nrow(cols_ops)){   for (iin col_names.length:1){ cols_ops[j,2]<- stringr::str_replace_all(as.character(cols_ops[j,2]),col_names.sorted[i], paste("hlp[",as.character(i),"]",sep="")) }   for (iin 1:col_names.length){ cols_ops[j,2]<- stringr::str_replace_all(as.character(cols_ops[j,2]), paste("hlp\\[",as.character(i),"\\]",sep=""), paste("x$",col_names.sorted[i], sep="")) } }   func_list <- function(x){ eval(parse(text = paste(as.character(cols_ops[,1])," <- ", cols_ops[,2],sep=""))) eval(parse(text = paste("data.frame(", paste(as.character(cols_ops[,1]),sep="",collapse=","), ")", sep=""))) }     plyr::ddply(df_to_use,.variable=cols_fix,function(x) func_list(x)) }
view rawoneForMany_fourth.R hosted with ❤ by GitHub

ex1_res04 <- oneForMany_fourth(ex1_df, ex1_vars, ex1_formulas)

This seems to work. We can test to confirm that the obtain results are the same as the hard coded ones:
testthat::expect_that(ex1_trueResult, testthat::equals(ex1_res04))

Alternatively, instead of trying to replicate the form of the hard-coded function, we could use the do.call function. For do.call, we need to specify the function we use and provide a list of arguments.

 oneForMany_fifth <- function(df_to_use= NA, cols_fix = NA, cols_ops = NA){   if ((is.null(df_to_use))| (is.null(cols_fix))| (is.null(cols_ops))) stop("You did not specify all arguments!")   if ((!is.data.frame(df_to_use))| (!is.character(cols_fix))| (!is.data.frame(cols_ops))) stop("Wrong argument type(s)! First and third arguments need to be a data frame; second argument needs to be a vector of characters!")   if (ncol(cols_ops)>2) stop("Data frame defined by third argument should have only two columns!")   func_list <- eval(parse(text= paste("as.list(c(", paste("parse(text=\"", as.character(cols_ops[,2]),sep="",collapse = "\"),"), "\")))",sep="")))   func_list <- setNames(func_list, as.character(cols_ops[,1]))   result_mutate <- do.call(plyr::ddply, c(list(.data= df_to_use,.variables = cols_fix, .fun = plyr::mutate),func_list))   result_mutate[, c(cols_fix, as.character(cols_ops[,1]))] }
view rawoneForMany_fifth.R hosted with ❤ by GitHub

Note that we used the mutate instead of summarise function even thought we want data summary and this choice required additional filtering step. While summarise function perform group-wise summaries and creates a new data frame based on specified variables and functions, it fails when functions return data of different length (in our case, sum(hp) and mean(mpg) return single values, while mpg+2*carb result in multiple values).The mutatefunction adds columns to the existing data frame, so unequal lengths are not the issue.Another important thing to mention is that the arguments list has to be a named list.

We can run it and test the results
ex1_res05 <- oneForMany_fifth(ex1_df, ex1_vars, ex1_formulas) testthat::expect_that(ex1_trueResult, testthat::equals(ex1_res05))

This and another example are also available at:

 # Using eval to create a function that will be able to produce a data summary (using the plyr::ddply function) # based on the user specified data frame, variables (that correspond to table column names) over which they # want to split and summarize the data (character vector), and a two-column table (data frame) which contains # the user specified name of the function used for summary (first column) and the function used for summary (second column) # # For more info, see http://www.vesnam.com/Rblog/one-function-to-run-them-all/ #   # Data, variable, and formulas used for data summary ex1_df <- datasets::mtcars ex1_vars <- "gear" ex1_formulas <- data.frame(labs=c("cust_val","total_hp","avg_mpg"),ress=c("mpg+2*carb","sum(hp)","mean(mpg)"),stringsAsFactors = FALSE)   # This is the expected result ex1_trueResult <- plyr::ddply(ex1_df,.variable=c("gear"),function(x) data.frame(cust_val= x$mpg+2*x$carb, total_hp = sum(x$hp), avg_mpg = mean(x$mpg)))   # Testing version #! # It passes an expression vector to eval and only the last element is evaluated. ex1_res01 <- oneForMany_first(ex1_df, ex1_vars, ex1_formulas)   # Testing version #2 # Using sapply to overcome the eval expression vector issue # Everything is calculated, but data is not in desired format (due to column names mismatches) # and one would need to parse it ex1_res02 <- oneForMany_secpmd(ex1_df, ex1_vars, ex1_formulas)   # Testing version #3 # Using a list instead of vector; everything is calculated, but data is not in desired format # and one would need to parse it ex1_res03 <- oneForMany_third(ex1_df, ex1_vars, ex1_formulas)   # Testing version #4 # Using eval to create a function that returns a data frame with desired columns (user specified functions) ex1_res04 <- oneForMany_fourth(ex1_df, ex1_vars, ex1_formulas) # We can test to confirm that the results are OK testthat::expect_that(ex1_trueResult,testthat::equals(ex1_res04))   # Testing version #5 # Using do.call to call plyer ex1_res05 <- oneForMany_fifth(ex1_df, ex1_vars, ex1_formulas) # We can test to confirm that the results are OK testthat::expect_that(ex1_trueResult,testthat::equals(ex1_res05))   # Different data set ex2_df <- datasets::quakes ex2_vars <- c("lat","long") ex2_formulas <- data.frame(what=c("l1","l2","l3","l4"), how=c("lat+2*long","depth^2/log(mag+1)","(lat+long)/(stations*0.5)","mean(mag)"),stringsAsFactors = FALSE)   ex2_trueResult <- plyr::ddply(ex2_df,.variable=c("lat","long"),function(x) data.frame(l1= x$lat+2*x$long, l2 = x$depth^2/log(x$mag+1), l3 = (x$lat+x$long)/(x$stations*0.5), l4 = mean(x$mag)))   ex2_res04 <- oneForMany_fourth(ex2_df, ex2_vars, ex2_formulas) testthat::expect_that(ex2_trueResult,testthat::equals(ex2_res04))   ex2_res05 <- oneForMany_fifth(ex2_df, ex2_vars, ex2_formulas) testthat::expect_that(ex2_trueResult,testthat::equals(ex2_res05))
view rawoneForMany.R hosted with ❤ by GitHub

0 0