Examples of the perils of globals in R and Stata
Asked Answered
L

10

45

In recent conversations with fellow students, I have been advocating for avoiding globals except to store constants. This is a sort of typical applied statistics-type program where everyone writes their own code and project sizes are on the small side, so it can be hard for people to see the trouble caused by sloppy habits.

In talking about avoidance of globals, I'm focusing on the following reasons why globals might cause trouble, but I'd like to have some examples in R and/or Stata to go with the principles (and any other principles you might find important), and I'm having a hard time coming up with believable ones.

  • Non-locality: Globals make debugging harder because they make understanding the flow of code harder
  • Implicit coupling: Globals break the simplicity of functional programming by allowing complex interactions between distant segments of code
  • Namespace collisions: Common names (x, i, and so forth) get re-used, causing namespace collisions

A useful answer to this question would be a reproducible and self-contained code snippet in which globals cause a specific type of trouble, ideally with another code snippet in which the problem is corrected. I can generate the corrected solutions if necessary, so the example of the problem is more important.

Relevant links:

Global Variables are Bad

Are global variables bad?

Livorno answered 2/4, 2011 at 22:29 Comment(0)
L
29

I also have the pleasure of teaching R to undergraduate students who have no experience with programming. The problem I found was that most examples of when globals are bad, are rather simplistic and don't really get the point across.

Instead, I try to illustrate the principle of least astonishment. I use examples where it is tricky to figure out what was going on. Here are some examples:

  1. I ask the class to write down what they think the final value of i will be:

    i = 10
    for(i in 1:5)
        i = i + 1
    i
    

    Some of the class guess correctly. Then I ask should you ever write code like this?

    In some sense i is a global variable that is being changed.

  2. What does the following piece of code return:

    x = 5:10
    x[x=1]
    

    The problem is what exactly do we mean by x

  3. Does the following function return a global or local variable:

     z = 0
     f = function() {
         if(runif(1) < 0.5)
              z = 1
         return(z)
      }
    

    Answer: both. Again discuss why this is bad.

Lettuce answered 3/4, 2011 at 8:41 Comment(5)
These are all clever, but I really like #3 here, as it provides a nice entrée into broader discussions of scope and return values.Livorno
Is the x=1 in #2 a mistake or a deliberate assignment to daze and confuse poor unexpected undergrads?Golgi
@RomanLuštrik I came upon this example, when a poor, confused undergraduate submitted their homework. I just use it to make people think. What do they think is happening?Lettuce
A fun modification of #2: x<-5:10;x[x<-1]` returns 5, but x is now 1 in the base environment.Livorno
x[x=x] could be a fun modification. As would x[x=x<-1] and x[x<-x<-1]. The latter gives different results the first and second time it is run. But I guess it might be a bit unnecessary, and you would have have bored some of your students by then...Hebraism
E
19

Oh, the wonderful smell of globals...

All of the answers in this post gave R examples, and the OP wanted some Stata examples, as well. So let me chime in with these.

Unlike R, Stata does take care of locality of its local macros (the ones that you create with local command), so the issue of "Is this this a global z or a local z that is being returned?" never comes up. (Gosh... how can you R guys write any code at all if locality is not enforced???) Stata has a different quirk, though, namely that a non-existent local or global macro is evaluated as an empty string, which may or may not be desirable.

I have seen globals used for several main reasons:

  1. Globals are often used as shortcuts for variable lists, as in

    sysuse auto, clear
    regress price $myvars
    

I suspect that the main usage of such construct is for someone who switches between interactive typing and storing the code in a do-file as they try multiple specifications. Say they try regression with homoskedastic standard errors, heteroskedastic standard errors, and median regression:

    regress price mpg foreign
    regress price mpg foreign, robust
    qreg    price mpg foreign

And then they run these regressions with another set of variables, then with yet another one, and finally they give up and set this up as a do-file myreg.do with

    regress price $myvars
    regress price $myvars, robust
    qreg    price $myvars
    exit

to be accompanied with an appropriate setting of the global macro. So far so good; the snippet

    global myvars mpg foreign
    do myreg

produces the desirable results. Now let's say they email their famous do-file that claims to produce very good regression results to collaborators, and instruct them to type

    do myreg

What will their collaborators see? In the best case, the mean and the median of mpg if they started a new instance of Stata (failed coupling: myreg.do did not really know you meant to run this with a non-empty variable list). But if the collaborators had something in the works, and too had a global myvars defined (name collision)... man, would that be a disaster.

You can take it a half step further in obscurity. Let's say that the global macro myvars is defined as global myvars mpg foreign, robust (nobody enforces what goes into the macro, right?). Then the first reg $myvars will produce the regression with HCE standard errors; the second reg $myvars, robust is going to complain that the variable robust isn't found, and qreg $myvars will complain about option robust not being supported.

  1. Globals are used for directory or file names, as in:

    use $mydir\data1, clear
    

    God only knows what will be loaded. In large projects, though, it does come handy. You would want to define global mydir somewhere in your master do-file, may be even as

    global mydir `c(pwd)'
    
  2. Globals can be used to store an unpredictable crap, like a whole command:

    capture $RunThis
    

    God only knows what will be executed; let's just hope it is not ! format c:\. This is the worst case of implicit strong coupling, but since I am not even sure that RunThis will contain anything meaningful, I put a capture in front of it, and will be prepared to treat the non-zero return code _rc. (See, however, my example below.)

  3. Globals as behavior switches (page 16 of https://hwpi.harvard.edu/files/sdp/files/sdp-toolkit-coding-style-guide.pdf). Don't. This just means you need to break your code into separate do-files and run each as needed. Even if the switch is preceded by extensive data manipulation that takes computing time... it means that the said computing should write the results to disk, and the next step that they have as // STUFF should use that_data, clear first.

  4. Stata's own use of globals is for God settings, like the type I error probability/confidence level: the global $S_level is always defined (and you must be a total idiot to redefine this global, although of course it is technically doable). This is, however, mostly a legacy issue with code of version 5 and below (roughly), as the same information can be obtained from less fragile system constant:

    set level 90
    display $S_level
    display c(level)
    

Thankfully, globals are quite explicit in Stata, and hence are easy to debug and remove. In some of the above situations, and certainly in the first one, you'd want to pass parameters to do-files which are seen as the local `0' inside the do-file. Instead of using globals in the myreg.do file, I would probably code it as

    unab varlist : `0'
    regress price `varlist'
    regress price `varlist', robust
    qreg    price `varlist'
    exit

The unab thing will serve as an element of protection: if the input is not a legal varlist, the program will stop with an error message.

In the worst cases I've seen, the global was used only once after having been defined.

There are occasions when you do want to use globals, because otherwise you'd have to pass the bloody thing to every other do-file or a program. One example where I found the globals pretty much unavoidable was coding a maximum likelihood estimator where I did not know in advance how many equations and parameters I would have. Stata insists that the (user-supplied) likelihood evaluator will have specific equations. So I had to accumulate my equations in the globals:

global my_parameters

forvalues k=1/`number_of_equations' {
   local this_equation: piece `k' of syntax
   // maybe do more parsing of the equation as needed
   global my_parameters ${my_parameters} (eq`k': parsed_specification) 
}

... and then call my evaluator with the globals in the descriptions of the syntax that Stata would need to parse:

args lf ${my_parameters}

where lf was the objective function (the log-likelihood). I encountered this at least twice, in the normal mixture package (denormix) and confirmatory factor analysis package (confa); you can findit both of them, of course.

Educator answered 2/8, 2011 at 19:2 Comment(1)
Beautiful answer StasK. Thank you. I recognize a lot of Stata pain of my own in your explanations. Personally, especially in Stata where macros can contain half-commands and other insanity, I try to stick to the Code Complete recommendations and use them only in lieu of constants.Livorno
T
12

One R example of a global variable that divides opinion is the stringsAsFactors issue on reading data into R or creating a data frame.

set.seed(1)
str(data.frame(A = sample(LETTERS, 100, replace = TRUE),
               DATES = as.character(seq(Sys.Date(), length = 100, by = "days"))))
options("stringsAsFactors" = FALSE)
set.seed(1)
str(data.frame(A = sample(LETTERS, 100, replace = TRUE),
               DATES = as.character(seq(Sys.Date(), length = 100, by = "days"))))
options("stringsAsFactors" = TRUE) ## reset

This can't really be corrected because of the way options are implemented in R - anything could change them without you knowing it and thus the same chunk of code is not guaranteed to return exactly the same object. John Chambers bemoans this feature in his recent book.

Thorbert answered 2/4, 2011 at 22:45 Comment(4)
This is a nifty example, and actually eye-opening for me since I probably rely on function defaults more than I should. However, it's an example of how someone else (the S/R designers) used globals to create trouble, rather than how a basic applied statistical programmer might run into trouble with them. So it's helpful and much appreciated, but not precisely what I can use as an example.Livorno
"Chamber bemoans... " What page? The index has no entries for "factors", "globals" or "stringsAsFactors", and I just went through section 6.5 which covers dataframes wihtout finding it.Beanie
@DWin the general idea that a function isn't in general guaranteed to return the same output when provided with the same inputs. IIRC in was in the opening chapter(s) of the book. My copy is at work so can't look for it now.Thorbert
@DWin, see section 3.2 on Functions & Functional Programming: '"More insidious are functions, such as options(), that create a hidden side effect, usually in C code."'Senhor
T
8

A pathological example in R is the use of one of the globals available in R, pi, to compute the area of a circle.

> r <- 3
> pi * r^2
[1] 28.27433
> 
> pi <- 2
> pi * r^2
[1] 18
> 
> foo <- function(r) {
+     pi * r^2
+ }
> foo(r)
[1] 18
> 
> rm(pi)
> foo(r)
[1] 28.27433
> pi * r^2
[1] 28.27433

Of course, one can write the function foo() defensively by forcing the use of base::pi but such recourse may not be available in normal user code unless packaged up and using a NAMESPACE:

> foo <- function(r) {
+     base::pi * r^2
+ }
> foo(r = 3)
[1] 28.27433
> pi <- 2
> foo(r = 3)
[1] 28.27433
> rm(pi)

This highlights the mess you can get into by relying on anything that is not solely in the scope of your function or passed in explicitly as an argument.

Thorbert answered 4/4, 2011 at 8:35 Comment(0)
M
8

Here's an interesting pathological example involving replacement functions, the global assign, and x defined both globally and locally...

x <- c(1,NA,NA,NA,1,NA,1,NA)

local({

    #some other code involving some other x begin
    x <- c(NA,2,3,4)
    #some other code involving some other x end

    #now you want to replace NAs in the the global/parent frame x with 0s
    x[is.na(x)] <<- 0
})
x
[1]  0 NA NA NA  0 NA  1 NA

Instead of returning [1] 1 0 0 0 1 0 1 0, the replacement function uses the index returned by the local value of is.na(x), even though you're assigning to the global value of x. This behavior is documented in the R Language Definition.

Muller answered 1/8, 2012 at 16:28 Comment(0)
A
5

One quick but convincing example in R is to run the line like:

.Random.seed <- 'normal'

I chose 'normal' as something someone might choose, but you could use anything there.

Now run any code that uses generated random numbers, for example:

rnorm(10)

Then you can point out that the same thing could happen for any global variable.

I also use the example of:

x <- 27
z <- somefunctionthatusesglobals(5)

Then ask the students what the value of x is; the answer is that we don't know.

Autochthonous answered 3/4, 2011 at 2:39 Comment(0)
P
5

Through trial and error I've learned that I need to be very explicit in naming my function arguments (and ensure enough checks at the start and along the function) to make everything as robust as possible. This is especially true if you have variables stored in global environment, but then you try to debug a function with a custom valuables - and something doesn't add up! This is a simple example that combines bad checks and calling a global variable.

glob.arg <- "snake"
customFunction <- function(arg1) {
    if (is.numeric(arg1)) {
        glob.arg <- "elephant"
    }

    return(strsplit(glob.arg, "n"))
}

customFunction(arg1 = 1) #argument correct, expected results
customFunction(arg1 = "rubble") #works, but may have unexpected results
Pomona answered 3/4, 2011 at 5:3 Comment(0)
L
3

An example sketch that came up while trying to teach this today. Specifically, this focuses on trying to give intuition as to why globals can cause problems, so it abstracts away as much as possible in an attempt to state what can and cannot be concluded just from the code (leaving the function as a black box).

The set up

Here is some code. Decide whether it will return an error or not based on only the criteria given.

The code

stopifnot( all( x!=0 ) )
y <- f(x)
5/x

The criteria

Case 1: f() is a properly-behaved function, which uses only local variables.

Case 2: f() is not necessarily a properly-behaved function, which could potentially use global assignment.

The answer

Case 1: The code will not return an error, since line one checks that there are no x's equal to zero and line three divides by x.

Case 2: The code could potentially return an error, since f() could e.g. subtract 1 from x and assign it back to the x in the parent environment, where any x element equal to 1 could then be set to zero and the third line would return a division by zero error.

Livorno answered 26/10, 2011 at 18:5 Comment(0)
L
2

Here's one attempt at an answer that would make sense to statisticsy types.

  • Namespace collisions: Common names (x, i, and so forth) get re-used, causing namespace collisions

First we define a log likelihood function,

logLik <- function(x) {
   y <<- x^2+2
   return(sum(sqrt(y+7)))
}

Now we write an unrelated function to return the sum of squares of an input. Because we're lazy we'll do this passing it y as a global variable,

sumSq <- function() {
   return(sum(y^2))
}

y <<- seq(5)
sumSq()
[1] 55

Our log likelihood function seems to behave exactly as we'd expect, taking an argument and returning a value,

> logLik(seq(12))
[1] 88.40761

But what's up with our other function?

> sumSq()
[1] 633538

Of course, this is a trivial example, as will be any example that doesn't exist in a complex program. But hopefully it'll spark a discussion about how much harder it is to keep track of globals than locals.

Livorno answered 4/4, 2011 at 16:36 Comment(0)
A
0

In R you may also try to show them that there is often no need to use globals as you may access the variables defined in the function scope from within the function itself by only changing the enviroment. For example the code below

zz="aaa"
x = function(y) { 
     zz="bbb"
     cat("value of zz from within the function: \n")
     cat(zz , "\n")
     cat("value of zz from the function scope: \n")
     with(environment(x),cat(zz,"\n"))
}
Adp answered 2/4, 2011 at 22:54 Comment(4)
Hmm interesting. I tend to think that writing a function that uses something outside its scope is a bad idea. If you want to use something in a function it should be passed in as an argument. If done that way, the function is self contained - the same function call will always give the same result.Thorbert
I am not really much into details on how R works, but are objects passed by value or by reference. Are there both ways possible or not?Rapturous
I think you can achieve both, but in general objects are passed by value. So when you pass in objects to functions they are copied to supply their value to the argument.Thorbert
To follow up on Gavin's comment, there's a discussion of pass-by-reference here: https://mcmap.net/q/268459/-can-you-pass-by-reference-in-r/… . Thanks Gavin. Always learn a lot from your posts.Livorno

© 2022 - 2024 — McMap. All rights reserved.