How to rewrite this Stata code in R?

Asked 17/2, 2011 at 2:25 Answered 20/2, 2011 at 7:19

One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?

foreach i in A B C D {  
    forval n=1990/2000 {  
       local m = 'n'-1  
       # create new columns from existing ones on-the-fly  
       generate pop'i''n' = pop'i''m' * (1 + trend'n')  
   }  
}

Andriaandriana answered 17/2, 2011 at 2:25 Comment(5)

for those that don't speak stata, maybe add what the final output should look like? And the input data for that matter... – Hargrove 17/2, 2011 at 2:29

I'm wondering what idiot designer of a statistical package decided that 1990/2000 was a range rather than a division facepalm – Pammy 17/2, 2011 at 15:25

@Spacedman: You don't know the half of it. I used Stata for 3 years. Worst. Programming. Language. Ever. – Fixation 17/2, 2011 at 15:29

@Joshua : May I kindly agree :-) But it has to be said, it is quite a powerful statistical package. You just shouldn't be dreaming about anything else but scripting your analysis. – Levulose 17/2, 2011 at 15:40

@Joris: Though I didn't explicitly say so, I agree that Stata has a lot of statistical capability. That's why I was careful to specifically say programming in Stata is terrible. ;-) – Fixation 17/2, 2011 at 15:45

DONT do it in R. The reason its messy is because its UGLY code. Constructing lots of variables with programmatic names is a BAD THING. Names are names. They have no structure, so do not try to impose one on them. Decent programming languages have structures for this - rubbishy programming languages have tacked-on 'Macro' features and end up with this awful pattern of constructing variable names by pasting strings together. This is a practice from the 1970s that should have died out by now. Don't be a programming dinosaur.

For example, how do you know how many popXXXX variables you have? How do you know if you have a complete sequence of pop1990 to pop2000? What if you want to save the variables to a file to give to someone. Yuck, yuck yuck.

Use a data structure that the language gives you. In this case probably a list.

Pammy answered 17/2, 2011 at 8:5 Comment(0)

Both Spacedman and Joshua have very valid points. As Stata has only one dataset in memory at any given time, I'd suggest to add the variables to a dataframe (which is also a kind of list) instead of to the global environment (see below).

But honestly, the more R-ish way to do so, is to keep your factors factors instead of variable names.

I make some data as I believe it is in your R version now (at least, I hope so...)

Data <- data.frame(
    popA1989 = 1:10,
    popB1989 = 10:1,
    popC1989 = 11:20,
    popD1989 = 20:11
)

Trend <- replicate(11,runif(10,-0.1,0.1))

You can then use the stack() function to obtain a dataframe where you have a factor pop and a numeric variable year

newData <- stack(Data)
newData$pop <- substr(newData$ind,4,4)
newData$year <- as.numeric(substr(newData$ind,5,8))
newData$ind <- NULL

Filling up the dataframe is then quite easy :

for(i in 1:11){

  tmp <- newData[newData$year==(1988+i),]
  newData <- rbind(newData,
      data.frame( values = tmp$values*Trend[,i],
                  pop = tmp$pop,
                  year = tmp$year+1
      )
  )
}

In this format, you'll find most R commands (selections of some years, of a single population, modelling effects of either or both, ...) a whole lot easier to perform later on.

And if you insist, you can still create a wide format with unstack()

unstack(newData,values~paste("pop",pop,year,sep=""))

Adaptation of Joshua's answer to add the columns to the dataframe :

for(L in LETTERS[1:4]) {
  for(i in 1990:2000) {
    new <- paste("pop",L,i,sep="")  # create name for new variable
    old <- get(paste("pop",L,i-1,sep=""),Data)  # get old variable
    trend <- Trend[,i-1989]  # get trend variable
    Data <- within(Data,assign(new, old*(1+trend)))
  }
}

Levulose answered 17/2, 2011 at 15:39 Comment(2)

Can you explain what you mean by "keep your factors factors instead of variable names"? – Hobbs 30/4, 2015 at 19:23

@KevinM That's the difference between "long format" and "wide format". You put all data in a single column, and use a factor or categorical variable to describe which data is from which population and year. If you use your variable names to indicate which year and population we're talking about, you'll have more difficulty using that information. Both population and year are categorical variables in terms of statistical analysis. So I keep them as a categorical variable (factor) instead of combining them to construct variable names. – Levulose 6/5, 2015 at 12:15

Assuming popA1989, popB1989, popC1989, popD1989 already exist in your global environment, the code below should work. There are certainly more "R-like" ways to do this, but I wanted to give you something similar to your Stata code.

for(L in LETTERS[1:4]) {
  for(i in 1990:2000) {
    new <- paste("pop",L,i,sep="")  # create name for new variable
    old <- get(paste("pop",L,i-1,sep=""))  # get old variable
    trend <- get(paste("trend",i,sep=""))  # get trend variable
    assign(new, old*(1+trend))
  }
}

Fixation answered 17/2, 2011 at 3:21 Comment(0)

Assuming you have population data in vector pop1989 and data for trend in trend.

require(stringr)# because str_c has better default for sep parameter
dta <- kronecker(pop1989,cumprod(1+trend))
names(dta) <- kronecker(str_c("pop",LETTERS[1:4]),1990:2000,str_c)

Noway answered 20/2, 2011 at 7:19 Comment(0)

Recommended topics

Hot tags