Searching for a straightforward way to do Stata's bysort tasks in R

R

2

13

I'm very new to R, and have been struggling for a couple of days to do something that Stata makes quite straightforward. A friend has given me a relatively complicated answer to this question, but I was wondering if there was a simple way to do the following.

Suppose I have a two variable dataframe, organized as below:

category    var1 
a            1     
a            2    
a            3     
b            4     
b            6  
b            8   
b           10    
c           11     
c           14      
c           17

I want to generate five additional variables, each of which should be inserted into this same dataframe: var2, var3, var4, var5, and var6

(1) var2 is a dummy variable, which takes the value 1 for the first observation in every category (i.e., each of the three groups defined by category), and 0 otherwise.

(2) var3 is a dummy variable, which takes the value 1 for the last observation in every category, 0 otherwise.

(3) var4 counts how many observations are in each of the group to which any particular observation belongs (i.e., so 3 for category a, 4 for category b, and 3 for category c)

(4) var5 records the difference between each observation in var1 and the observation above it

(5) var6 records the difference between each observation in var1 and the observation above it, but only within the groups defined by category.

I am reasonably well acquainted with Stata, and I find all of the above are not difficult to do, using the bysort prefix command. For example, var1 is easily generated by bysort category: gen var1=1 if _n==1. But I've been tearing my hair out for the last day trying to figure out how to do them using R. I'm sure there are several solutions (my friend's involved the ddplyr package, which seemed a step above my paygrade). Is there nothing as easy as bysort?

The final dataset should look something like this:

category    var1     var2     var3     var4     var5    var6   
a            1        1        0        3       n/a      n/a
a            2        0        0        3        1        1
a            3        0        1        3        1        1
b            4        1        0        4        1       n/a
b            6        0        0        4        2        2
b            8        0        0        4        2        2
b           10        0        1        4        2        2
c           11        1        0        3        1       n/a
c           14        0        0        3        3        3
c           17        0        1        3        3        3

Thanks very much for any suggestions, in advance. Sorry for the rookie question; I'm sure this is answered somewhere else, but I wasn't able to find it despite hours of searching.

Riddick answered 13/8, 2014 at 2:18 Comment(2)

Meta-comment as an experienced Stata user, and a very occasional R user: I assume that there is always a smart way to do something in R, but there is no rule that it will look anything like the Stata solution. – Knobkerrie 13/8, 2014 at 7:33

In Stata bysort category: gen var1 = 1 if _n==1 does not produce a 0 or 1 indicator variable (dummy in your terminology), but a missing or 1 indicator. For that you need bysort category: gen var1 = _n==1. – Knobkerrie 13/8, 2014 at 7:37

S

12

dat <- read.table(header = TRUE, 
           text = 
'category    var1 
a            1     
a            2    
a            3     
b            4     
b            6  
b            8   
b           10    
c           11     
c           14      
c           17')


(dat <- within(dat, {
  var6 <- ave(var1, category, FUN = function(x) c(NA, diff(x)))
  var5 <- c(NA, diff(var1))
  var4 <- ave(var1, category, FUN = length)
  var3 <- rev(!duplicated(rev(category))) * 1
  var2 <- (!duplicated(category)) * 1
}))

#    category var1 var2 var3 var4 var5 var6
# 1         a    1    1    0    3   NA   NA
# 2         a    2    0    0    3    1    1
# 3         a    3    0    1    3    1    1
# 4         b    4    1    0    4    1   NA
# 5         b    6    0    0    4    2    2
# 6         b    8    0    0    4    2    2
# 7         b   10    0    1    4    2    2
# 8         c   11    1    0    3    1   NA
# 9         c   14    0    0    3    3    3
# 10        c   17    0    1    3    3    3

Sinasinai answered 13/8, 2014 at 2:38 Comment(0)

B

9

An answer using dplyr

library(dplyr)
dat <- dat %>%
 group_by(category) %>%
 mutate(var2 = ifelse(row_number() == 1, 1, 0))%>%
 mutate(var3 = ifelse(row_number() == n(), 1, 0)) %>%
 mutate(var4 = n()) %>%
 mutate(var6 = lag(var1, 1)) %>%
 ungroup() %>%
 mutate(var5 = lag(var1, 1))

Breskin answered 6/6, 2015 at 16:15 Comment(0)

Recommended topics

Hot tags