Lexical scope and function closures in R

来源：互联网发布：java架构师招聘编辑：程序博客网时间：2024/05/18 00:35

Introduction

R is different to many “easy to use” statistical software packages – it expects to be given commands at the R command prompt. This can be intimidating for new users, but is at the heart of its power. Most powerful software tools have an underlying scripting language. This is because scriptable tools are typically more flexible, and easier to automate, script, program, etc. In fact, even software packages like Excel or Minitab have a macro programming language behind the scenes available for “power users” to exploit.

Programming from the ground up

It is natural to want to automate (repetitive) tasks on a computer, to automate a “work flow”. This is especially natural for computational tasks, as all software tools are built from programming language components, anyway. In R, you do stuff by executing a sequence of commands. By putting a bunch of commands one after another into a text file, we can source the file, and script R. Scripting is the simplest form of programming – automating a sequence of tasks. Indeed, in Unix (including Linux and MacOS), we can put a bunch of Unix shell commands together in a shell script. In Windows, you can put a bunch of terminal commands together in a batch file.

Next, one can add in simple control structures, to support looping, branching and conditional execution. Looping allows repetition of very similar tasks. Branching and conditional execution allow decisions to be made depending on what has already happened. Most scripting languages support simple control structures – this allows carrying out of tasks which we could do in principle, but perhaps not in practice, due to the laborious and repetitive nature of some work-flows. We can go a long way with this, but…

Although scripting is a simple form of programming, it isn’t “real” programming, or software engineering. Software engineering is about developing flexible, modular, robust, re-usable, generic program components, and using them to build large, complex software systems – modularity is absolutely key here. Functions and procedures are a first step towards introducing modularity, allowing the development of “real” software. Proper support for these tends to distinguish “real” programming languages from scripting languages (though many modern “scripting” languages have at least some limited support, and the distinction between scripting languages and “real” languages is now very blurred).

Functions and procedures

Procedures (or subroutines) are re-usable pieces of code which can be called from other pieces of code when needed. They may be provided with inputs, but do not have to be. They are usually called for their “side-effects”, such as doing plots, changing global variables, or reading/writing data to/from disk.

Functions are also re-usable pieces of code, but are mainly used to obtain a return-value that is computed on the basis of the given inputs. “Pure” functions do not have any side-effects. Functions and procedures may be combined in a hierarchical way to build large, complex algorithms from much simpler modular components. Note that many languages (including R), do not make a distinction between functions and procedures in the syntax of the language, but conceptually the distinction is really quite important.

Variable scope

Almost all programming languages allow the definition of variables which are labels or tags representing or pointing at some value that may be defined and re-defined at run-time. In most modern programming languages, functions can define local variables which can be used in addition to any inputs (formal parameters) of the function – these are very important for the development of modular, re-usable code components. In particular, they help to avoid unanticipated name clashes in the global name-space. If a function refers to a variable which is neither a formal parameter nor a local variable, then a rule is needed to find which (if any) variable with that label is in scope for the function, so that the program can know what value to use.

Dynamic scope

Under dynamic scope, if an “unknown” variable is referred to in a function, the idea is to use the version of the variable that is in scope at the time that the function was called (and apply this rule recursively) – this is the scoping rule used by the S-PLUS implementation of the S language. Dynamic scope was common among early dynamic programming languages – including early implementations of LISP (and is still used in Emacs LISP), as it was quite intuitive and natural to implement using a stack-based approach similar to the stack-based approach to passing variables in and out of subroutines commonly used by machine code and assembly programmers.

Despite being intuitively appealing, at least initially, there are a number of problems with dynamic scope in practice. In particular, we can’t really know by code inspection whether or not a given section of code will run in all situations without actually running the code, as we can’t know whether all variable bindings will resolve correctly. This is an issue even for dynamic languages, but is particularly problematic for strongly typed compiled languages, as it becomes difficult for the compiler to figure out the types of all variables correctly and therefore generate the appropriate byte-code. It is also very difficult for a function to have associated state – to do this, you must somehow get state variables into global name-space where they then become vulnerable to masking and name clashes. See the Wikipedia page on scope for further details.

Lexical scope

Under lexical scoping rules, if an “unknown” variable is referred to in a function, the idea is to use the version that is “in scope” in the enclosing piece of code (and apply this rule recursively) — this is the scoping rule used by R (as R is built on top of a Scheme interpreter, a LISP derivative which emphasises lexical scope). Variable bindings can be all resolved, checked and verified at compile-time – this is safer, and in many other ways better. Most modern languages adopt lexical scoping, including most functional languages, such as LISPs (including LISP-STAT) and derivatives. In fact, I first read about lexical scope, function closures and their use in statistical computing in Luke Tierney’s LISP-STAT book (Tierney, 1990) in the early 1990s. That book was published over 20 years ago, so it just goes to show that there is nothing new about these functional programming approaches. In fact, although Tierney’s book describes a now obsolete system, I would nevertheless recommend reading it if you can find a copy, as I think it is still one of the best books on statistical computing ever written. It really puts the recent glut of horrible R-themed books to shame!

Given that R has been lexically scoped and has supported function closures since day one, it is reasonable to wonder why this programming style is not used more widely in R code. I think it is the difference in scoping rules between S-PLUS and R that has led to a fear of developing any R code which relies on non-local scoping rules. Certainly, in the early days of R, I would use S-PLUS at work and R at home, and I would want my code to work in exactly the same in both places! This is a shame, as lexical scoping is very powerful, and exploited widely in functional programming styles. The use of lexical scope and function closures in R is described quite nicely in Gentleman (2008), along with many other things.

To make sure that the concepts are clear, inspect the following piece of code and figure out what the result of the final function call will be. The answer is given below the code, so try not to peek before reading on…

1
2
3
4
5
6
7
8
9
10
11
12
13
a=1
b=2
f<-function(x)
{
  a*x + b
}
g<-function(x)
{
  a=2
  b=1
  f(x)
}
g(2)

No, really, try and figure it out before reading on for the answer! Understanding this example is key to understanding the difference between lexical and dynamic scope. Clearly the obvious answers are 4 and 5. If you didn’t get one of those, go back and try again!

0 0