Translating Stata to R: collapse

A

3

6

Just came across a .do file that I need to translate into R because I don't have a Stata license; my Stata is rusty, so can someone confirm that the code is doing what I think it is?

For reproducibility, I'm going to translate it into a data set I found online, specifically the Milk Production dataset (p004) that's part of a textbook by Chatterjee, Hadi and Price.

Here's the Stata code:

collapse (min) min_protein = protein /// 
         (mean) avg_protein = protein /// 
         (median) median_protein = protein /// 
         (sd) sd_protein = protein /// 
         if protein > 2.8, by(lactatio)

Here's what I think it's doing in data.table syntax:

library(data.table)
library(foreign)
DT = read.dta("p004.dta")
setDT(DT)

DT[protein > 2.8,
   .(min_protein = min(protein),
     avg_protein = mean(protein),
     median_protein = median(protein),
     sd_protein = sd(protein)),
   keyby = lactatio]

#    lactatio min_protein avg_protein median_protein sd_protein
# 1:        1         2.9    3.162632           3.10  0.2180803
# 2:        2         2.9    3.304688           3.25  0.2858736
# 3:        3         2.9    3.371429           3.35  0.4547672
# 4:        4         2.9    3.231250           3.20  0.3419917
# 5:        5         2.9    3.855556           3.20  1.9086061
# 6:        6         3.0    3.200000           3.10  0.2645751
# 7:        7         3.3    3.650000           3.65  0.4949748
# 8:        8         3.2    3.300000           3.30  0.1414214

Is that correct?

This would be easy to confirm if I had used Stata in the past 18 months or if I had a copy installed--hoping I can bend the ear of someone for whom either of these is true. Thanks.

Archenemy answered 22/6, 2015 at 19:13 Comment(8)

Note that you can always access Stata's help files (e.g., collapse) online, in case that helps clarify the code. – Spragens 22/6, 2015 at 19:34

yes, I read that, which is why I think the R code I supplied is right. Just hoping for some confirmation from anyone that knows both languages fairly well. – Archenemy 22/6, 2015 at 19:36

100 % correct. But, if you provide the sample data, I can reproduce the output. – Parmesan 22/6, 2015 at 19:38

Your translation is fine, but repeats the varname an awful lot. You could do something with a named list of functions instead, like

DT <- data.table(id=1:3)[,.(x=sample(10,5)),by=id]; DT[,lapply(list(minx = min, meanx = mean, medx = median, sdx = sd),function(f)f(x)),by=id]

– Cynthia 22/6, 2015 at 20:5

@Cynthia I was just thinking that, couldn't think of the lapply way to do it. to be fair this isn't much better! – Archenemy 22/6, 2015 at 20:12

@Archenemy FYI there is a statar package which may helps with translating stata to R – Shamrao 23/6, 2015 at 21:54

@Frank, I'm having a problem with translating stata collapseinto R. Would you help me out? – Saddlebow 20/1, 2020 at 10:19

@Saddlebow If you post a new question, someone can probably answer it (I'm not so frequently on SO and my Stata is a bit rusty). After you create the question, you can post the link in chat.stackoverflow.com/rooms/25312/r-public too, btw. – Cynthia 22/1, 2020 at 3:30

E

1

Your intuition is correct. collapse is the Stata equivalent of R's aggregate function, which produces a new dataset from an input dataset by applying an aggregating function (or multiple aggregating functions, one per variable) to every variable in a dataset.

Here's the output for that Stata command on the example dataset:

. list

     +------------------------------------------------------+
     | lactatio   min_pr~n   avg_pr~n   median~n   sd_pro~n |
     |------------------------------------------------------|
  1. |        1        2.9   3.162632        3.1   .2180803 |
  2. |        2        2.9   3.304688       3.25   .2858736 |
  3. |        3        2.9   3.371429       3.35   .4547672 |
  4. |        4        2.9    3.23125        3.2   .3419917 |
  5. |        5        2.9   3.855556        3.2   1.908606 |
     |------------------------------------------------------|
  6. |        6          3        3.2        3.1   .2645752 |
  7. |        7        3.3       3.65       3.65   .4949748 |
  8. |        8        3.2        3.3        3.3   .1414214 |
     +------------------------------------------------------+

Entomostracan answered 22/6, 2015 at 19:37 Comment(3)

Stata's collapse lets you create multiple aggregates at once, while R's aggregate gives only one (as far as I know). Seems the OP's data.table operation (or similar in dplyr) is a much better analogue. – Cynthia 22/6, 2015 at 20:1

@Cynthia Yes and no. Data.table does many thinks that are each covered by different Stata commands (e.g., drop, keep, collapse, merge, expand, reshape, etc.). The most direct analogue is aggregate, even though aggregate and collapse work slightly differently. The FUN argument to aggregate means you can specify class-specific aggregation functions, so it would be feasible (if verbose) to produce identical functionality using aggregate. – Entomostracan 22/6, 2015 at 20:22

Hm, well, I disagree, as I've never seen aggregate used in that way; and by (and a half-dozen other functions) could probably do the same thing with enough contortion. I wasn't referring to what data.table does broadly, but the syntax above, where it takes the form .(newx = f(x), newy = g(y), newz = h(z)), by=id which is as close to the Stata syntax as one can get in R. – Cynthia 22/6, 2015 at 20:33

P

4

Here is the Stata output for your sample data which is identical to the data.table output:

collapse (min) min_protein = protein /// 
         (mean) avg_protein = protein /// 
         (median) median_protein = protein /// 
         (sd) sd_protein = protein /// 
         if protein > 2.8, by(lactatio)

   lactatio min_protein avg_protein median_protein  sd_protein
     1  2.9 3.162632    3.1 0.2180803
     2  2.9 3.304688    3.25 0.2858736
     3  2.9 3.371429    3.35 0.4547672
     4  2.9 3.23125     3.2 0.3419917
     5  2.9 3.855556    3.2 1.908606
     6  3   3.2         3.1 0.2645752
     7  3.3 3.65        3.65 0.4949748
     8  3.2 3.3         3.3 0.1414214

and here is the data.table output (just to make you sure that I am using the right data)

    library(foreign) #reading Stata data
    data<-read.dta("p004.dta")
    setkey(setDT(data),lactatio)
    setDT(data)[protein>2.8,
                   .(min_protein=min(protein),
                     avg_protein=mean(protein),
                     median_protein=median(protein),
                     sd_protein=sd(protein)),
                   by=lactatio]

   lactatio min_protein avg_protein median_protein sd_protein
1:        1         2.9    3.162632           3.10  0.2180803
2:        2         2.9    3.304688           3.25  0.2858736
3:        3         2.9    3.371429           3.35  0.4547672
4:        4         2.9    3.231250           3.20  0.3419917
5:        5         2.9    3.855556           3.20  1.9086061
6:        6         3.0    3.200000           3.10  0.2645751
7:        7         3.3    3.650000           3.65  0.4949748
8:        8         3.2    3.300000           3.30  0.1414214
>

Parmesan answered 24/6, 2015 at 0:33 Comment(0)

E

1

Your intuition is correct. collapse is the Stata equivalent of R's aggregate function, which produces a new dataset from an input dataset by applying an aggregating function (or multiple aggregating functions, one per variable) to every variable in a dataset.

Here's the output for that Stata command on the example dataset:

. list

     +------------------------------------------------------+
     | lactatio   min_pr~n   avg_pr~n   median~n   sd_pro~n |
     |------------------------------------------------------|
  1. |        1        2.9   3.162632        3.1   .2180803 |
  2. |        2        2.9   3.304688       3.25   .2858736 |
  3. |        3        2.9   3.371429       3.35   .4547672 |
  4. |        4        2.9    3.23125        3.2   .3419917 |
  5. |        5        2.9   3.855556        3.2   1.908606 |
     |------------------------------------------------------|
  6. |        6          3        3.2        3.1   .2645752 |
  7. |        7        3.3       3.65       3.65   .4949748 |
  8. |        8        3.2        3.3        3.3   .1414214 |
     +------------------------------------------------------+

Entomostracan answered 22/6, 2015 at 19:37 Comment(3)

Stata's collapse lets you create multiple aggregates at once, while R's aggregate gives only one (as far as I know). Seems the OP's data.table operation (or similar in dplyr) is a much better analogue. – Cynthia 22/6, 2015 at 20:1

@Cynthia Yes and no. Data.table does many thinks that are each covered by different Stata commands (e.g., drop, keep, collapse, merge, expand, reshape, etc.). The most direct analogue is aggregate, even though aggregate and collapse work slightly differently. The FUN argument to aggregate means you can specify class-specific aggregation functions, so it would be feasible (if verbose) to produce identical functionality using aggregate. – Entomostracan 22/6, 2015 at 20:22

Hm, well, I disagree, as I've never seen aggregate used in that way; and by (and a half-dozen other functions) could probably do the same thing with enough contortion. I wasn't referring to what data.table does broadly, but the syntax above, where it takes the form .(newx = f(x), newy = g(y), newz = h(z)), by=id which is as close to the Stata syntax as one can get in R. – Cynthia 22/6, 2015 at 20:33

D

1

stata.collapse<-function(data,vars,newnames,stat,by) {
m=match(by,names(data))
data1=data[m]
x=length(by)
l=length(stat)

for (i in 1:l){
   nn=aggregate(data[vars[i]],by=data1,stat[i],na.rm=TRUE) 
   d=names(nn)
   d[ncol(data1)+1]<-newnames[i]
    names(nn)<-d
     xx1=nn[1:x]
     xx=nn[-(1:x)]
         if (i>1) {
         x2=cbind(x2,xx)
         }else {
        x2=nn
}
}

return(x2)
}

To run call this function like this

  h=stata.collapse(roster,c("idcode1","age","age") , 
    c("hhsize","meanage","maxage"),c("max","mean","max"),c("psu","hhno"))

Dunt answered 8/4, 2019 at 11:2 Comment(0)

Recommended topics

Hot tags