Reshape data frame from wide to panel with multiple variables and some time invariant
Asked Answered
S

2

8

This is a basic problem in data analysis which Stata deals with in one step.

Create a wide data frame with time invariant data (x0) and time varying data for years 2000 and 2005 (x1,x2):

d1 <- data.frame(subject = c("id1", "id2"),  
x0 = c("male", "female"),  
x1_2000 = 1:2,   
x1_2005 = 5:6,  
x2_2000 = 1:2,  
x2_2005 = 5:6    
) 

s.t.

subject x0 x1_2000 x1_2005 x2_2000 x2_2005  
1     id1 male         1       5       1       5  
2     id2 female       2       6       2       6  

I want to shape it like a panel so data looks like this:

        subject     x0 time x1 x2
1     id1   male 2000  1  1
2     id2 female 2000  2  2
3     id1   male 2005  5  5
4     id2 female 2005  6  6

I can do this with reshape s.t.

d2 <-reshape(d1, 
idvar="subject",
varying=list(c("x1_2000","x1_2005"),
    c("x2_2000","x2_2005")),
    v.names=c("x1","x2"),
    times = c(2000,2005),
    direction = "long",
    sep= "_")

My main concern is that when you have dozens of variables the above command gets very long. In stata one would simply type:

reshape long x1 x2, i(subject) j(year)

Is there such a simple solution in R?

Siebert answered 13/3, 2011 at 2:46 Comment(0)
C
12

reshape can guess many of its arguments. In this case it's sufficient to specify the following. No packages are used.

 reshape(d1, dir = "long", varying = 3:6, sep = "_")

giving:

       subject     x0 time x1 x2 id
1.2000     id1   male 2000  1  1  1
2.2000     id2 female 2000  2  2  2
1.2005     id1   male 2005  5  5  1
2.2005     id2 female 2005  6  6  2
Cyclist answered 13/3, 2011 at 5:7 Comment(2)
Nice! But what happens when variables are named sample_1_2000 etc... Can one be more spjhisticated with the sep= option?Siebert
@Fred, Use the split argument in place of sep, i.e. reshape(d1, dir = "long", varying = 3:6, split = list(regexp = "_2", include = TRUE)) or reduce this case to the one in the question, i.e. reshape(setNames(d1, sub("sample_", "", names(d1))), dir = "long", varying = 3:6, sep = "_")Cyclist
A
4

here is a brief example using reshape2 package:

library(reshape2)
library(stringr)

# it is always useful to start with melt
d2 <- melt(d1, id=c("subject", "x0"))

# redefine the time and x1, x2, ... separately
d2 <- transform(d2, time = str_replace(variable, "^.*_", ""),
                    variable = str_replace(variable, "_.*$", ""))

# finally, cast as you want
d3 <- dcast(d2, subject+x0+time~variable)

now you don't need even specifying x1 and x2.
This code works if variables increase:

> d1 <- data.frame(subject = c("id1", "id2"), x0 = c("male", "female"),
+ x1_2000 = 1:2,
+ x1_2005 = 5:6,
+ x2_2000 = 1:2,
+ x2_2005 = 5:6,
+ x3_2000 = 1:2,
+ x3_2005 = 5:6,
+ x4_2000 = 1:2,
+ x4_2005 = 5:6
+ ) 
> 
> d2 <- melt(d1, id=c("subject", "x0"))
> d2 <- transform(d2, time = str_replace(variable, "^.*_", ""),
+                     variable = str_replace(variable, "_.*$", ""))
> 
> d3 <- dcast(d2, subject+x0+time~variable)
> 
> d3
  subject     x0 time x1 x2 x3 x4
1     id1   male 2000  1  1  1  1
2     id1   male 2005  5  5  5  5
3     id2 female 2000  2  2  2  2
4     id2 female 2005  6  6  6  6
Autoroute answered 13/3, 2011 at 3:6 Comment(3)
Thanks, this is useful. Not very clear what transform is doing (help file not very helpful) nor how to interpret "^.*_" and "_.*$". I ask because some vars are actually named "sample_1_2000" etc...Siebert
Maybe this is more intuitive with year data named "sample_1_2000" and so on: temp1 <- transform(temp, time = str_sub(variable, -4), variable = str_sub(variable, 1,str_length(variable)-5))Siebert
@Siebert it depends on the format of variable name. if the length of (at least some part of) character is fixed, your way is easier. otherwise, regular expression is more flexible.Autoroute

© 2022 - 2024 — McMap. All rights reserved.