how spread() in tidyr handles factor levels
Asked Answered
N

1

5

I was manipulating my data and found that I did something wrong at some point in the process. When I explored the issue, the problem came down to the following behavior of spread() in the tidyr package.

Here's a demonstrative example. Let us say we have a data frame like the following.

> d <- data.frame(factor1 = rep(LETTERS[1:3], each = 3),
+   factor2 = rep(paste0("level", c(1, 2, 10)), 3),
+   num = 1:9
+ )  
> d
  factor1 factor2 num
1       A  level1   1
2       A  level2   2
3       A level10   3
4       B  level1   4
5       B  level2   5
6       B level10   6
7       C  level1   7
8       C  level2   8
9       C level10   9

What I wanted to do was to convert this long-formatted data frame into wide format. And I thought spread() is a way to go. The result, however, was not what I expected.

> spread(d, factor2, num)
  factor1 level1 level2 level10
1       A      1      3       2
2       B      4      6       5
3       C      7      9       8

If factor1 is "A" and factor2 is "level2", the value should be 2, but the resulting wide format says 3. Apparently, the num is ordered by the alphabetical order of factor2 (level1 > level10 > level2) and is placed into the wide format. But when it is, the factor2 labels retains the same order as they appear in the original data frame (level1 > level2 > level10).

Could anyone explain why this happens (and/or where I can find relevant information)?

Northerner answered 6/10, 2014 at 17:42 Comment(3)
Using the devel version of tidyr, the colnames match the numbers, but the order of columns is level1, level10, level2. That also seems to be solved by d$factor2 <- factor(d$factor2, levels=c('level1', 'level2', 'level10')); spread(d, factor2, num)Doble
I have tidyr version 0.1 and I got the correct result using your code. Maybe you should restart R and see if that changes things?Mnemosyne
It seems I was using the developmental version. When I installed the current one from CRAN, it worked fine. Thank you @Doble for pointing it out.Northerner
C
9

Using the data provided, I got different result:

> packageVersion("tidyr")
[1] ‘0.1’
spread(d, factor2, num)
  factor1 level1 level10 level2
1       A      1       3      2
2       B      4       6      5
3       C      7       9      8
Contingence answered 6/10, 2014 at 17:59 Comment(1)
I checked the version of the package before I posted, and since it said 0.1 I thought it's the latest version. But as @Doble mentioned, I was using the developmental version that I downloaded from github. When I installed the package from CRAN, it worked correctly. Thanks!Northerner

© 2022 - 2024 — McMap. All rights reserved.