How to use the spread function properly in tidyr

Asked 8/1, 2016 at 19:47 Answered 14/1, 2016 at 23:12

How do I change the following table from:

Type    Name    Answer     n
TypeA   Apple   Yes        5
TypeA   Apple   No        10
TypeA   Apple   DK         8
TypeA   Apple   NA        20
TypeA   Orange  Yes        6
TypeA   Orange  No        11
TypeA   Orange  DK         8
TypeA   Orange  NA        23

Change to:

Type    Name    Yes   No   DK   NA  
TypeA   Apple   5     10   8    20
TypeA   Orange  6     11   8    23

I used the following codes to get the first table.

df_1 <- 
  df %>% 
  group_by(Type, Name, Answer) %>% 
  tally()

Then I tried to use the spread command to get to the 2nd table, but I got the following error message:

"Error: All columns must be named"

df_2 <- spread(df_1, Answer)

Trochophore answered 8/1, 2016 at 19:47 Comment(0)

Following on the comment from ayk, I'm providing an example. It looks to me like when you have a data_frame with a column of either a factor or character class that has values of NA, this cannot be spread without either removing them or re-classifying the data. This is specific to a data_frame (note the dplyr class with the underscore in the name), as this works in my example when you have values of NA in a data.frame. For example, a slightly modified version of the example above:

Here is the dataframe

library(dplyr)
library(tidyr)
df_1 <- data_frame(Type = c("TypeA", "TypeA", "TypeB", "TypeB"),
                   Answer = c("Yes", "No", NA, "No"),
                   n = 1:4)
df_1

Which gives a data_frame that looks like this

Source: local data frame [4 x 3]

   Type Answer     n
  (chr)  (chr) (int)
1 TypeA    Yes     1
2 TypeA     No     2
3 TypeB     NA     3
4 TypeB     No     4

Then, when we try to tidy it, we get an error message:

df_1 %>% spread(key=Answer, value=n)
Error: All columns must be named

But if we remove the NA's then it 'works':

df_1 %>%
    filter(!is.na(Answer)) %>%
    spread(key=Answer, value=n)
Source: local data frame [2 x 3]

   Type    No   Yes
  (chr) (int) (int)
1 TypeA     2     1
2 TypeB     4    NA

However, removing the NAs may not give you the desired result: i.e. you might want those to be included in your tidied table. You could modify the data directly to change the NAs to a more descriptive value. Alternatively, you could change your data to a data.frame and then it spreads just fine:

as.data.frame(df_1) %>% spread(key=Answer, value=n)
   Type No Yes NA
1 TypeA  2   1 NA
2 TypeB  4  NA  3

Sazerac answered 14/1, 2016 at 23:12 Comment(1)

This is such good information to know. Thanks for sharing that knowledge. 8-) – Trochophore 18/1, 2016 at 16:38

I think only tidyr is needed to get from df_1 to df_2.

library(magrittr)
df_1 <- read.csv(text="Type,Name,Answer,n\nTypeA,Apple,Yes,5\nTypeA,Apple,No,10\nTypeA,Apple,DK,8\nTypeA,Apple,NA,20\nTypeA,Orange,Yes,6\nTypeA,Orange,No,11\nTypeA,Orange,DK,8\nTypeA,Orange,NA,23", stringsAsFactors=F)

df_2 <- df_1 %>% 
  tidyr::spread(key=Answer, value=n)

Output:

   Type   Name DK No Yes NA
1 TypeA  Apple  8 10   5 20
2 TypeA Orange  8 11   6 23

Kneedeep answered 9/1, 2016 at 19:32 Comment(4)

This method works in the code shown above. However, when the n column on the first table (i.e, df_1) is system generated using the tally function. This method does not work. It still gives me the Error: All columns must be named. The code I used to generate the n column using tally is listed above. Looks like the spread does not recognize the n column in this case. Any other suggestions? – Trochophore 11/1, 2016 at 19:24

Idea 1: use dplyr::ungroup() at the end of the creation of df_1. – Kneedeep 11/1, 2016 at 20:1

Idea 2: cast with as.data.frame(). Idea 3: output with dput() so we can see the structure better. #5963769. Otherwise, it's difficult to see how you got df_1. – Kneedeep 11/1, 2016 at 20:2

Thanks wibeasley. I finally figured out why I got the message "All columns must be named." This is because in my actual dataset, I have NA inside and when I do the spread, it does not have a name for the column. I did not include an NA in my example dataset, so that's why you don't see my problem. But anyways, it's great that the problem is now resolved. Thanks again for your help with this. 8-) – Trochophore 12/1, 2016 at 17:34

Recommended topics

Hot tags