Spread with duplicate identifiers for rows [duplicate]
Asked Answered
M

1

7

There has been questions on this topic before here, but I am still struggling with spreading this. I would like so each state to have its own column of temperatures values.

Here is a dput() of my data. I'll call it df

structure(list(date = c("2018-01-21", "2018-01-21", "2018-01-20", 
"2018-01-20", "2018-01-19", "2018-01-19", "2018-01-18", "2018-01-18", 
"2018-01-17", "2018-01-17", "2018-01-16", "2018-01-16", "2018-01-15", 
"2018-01-15", "2018-01-14", "2018-01-14", "2018-01-12", "2018-01-12", 
"2018-01-11", "2018-01-11", "2018-01-10", "2018-01-10", "2018-01-09", 
"2018-01-09", "2018-01-08", "2018-01-08", "2018-01-07", "2018-01-07", 
"2018-01-06", "2018-01-06", "2018-01-05", "2018-01-05", "2018-01-04", 
"2018-01-04", "2018-01-03", "2018-01-03", "2018-01-03", "2018-01-03", 
"2018-01-02", "2018-01-02"), tmin = c(24, 31, 31, 29, 44, 17, 
32, 7, 31, 7, 31, 6, 30, 13, 30, 1, 43, 20, 33, 52, 42, 29, 30, 
29, 26, 32, 33, -2, 29, 0, 23, 3, 19, 11, NA, -3, 22, -3, 24, 
-4), state = c("UT", "OH", "UT", "OH", "UT", "OH", "UT", "OH", 
"UT", "OH", "UT", "OH", "UT", "OH", "UT", "OH", "UT", "OH", "UT", 
"OH", "UT", "OH", "UT", "OH", "UT", "OH", "UT", "OH", "UT", "OH", 
"UT", "OH", "UT", "OH", "UT", "OH", "UT", "OH", "UT", "OH")), class = "data.frame", row.names = c(NA, 
-40L), .Names = c("date", "tmin", "state"))

The code I run is

df %>% spread(state,tmin)

which I expected to give me the following format

date UT  OH
... ... ...

but I get the error message

Error: Duplicate identifiers for rows (36, 38), (35, 37)

I have tried a few different things. One thing I tried was grouping by date. I was thinking that rows of the same date were causing a problem for spread. I also tried making a new row with add_rownames() then using spread(state,tmin), but that also failed to solve the issue.

Milda answered 22/1, 2018 at 14:17 Comment(2)
Create a sequence column by group and then spread i.e. df %>% group_by(state) %>% mutate(ind = row_number()) %>% spread(state, tmin)Drolet
Adding a sequence column makes spreading possible but doesn't resolve the underlying data quality issue.Dipody
D
9

In order for spread to work as intended, the resulting data frame must have uniquely identified rows and columns. In the case of your data, the "date" column is the only unique identifier after spreading. However, rows 36 and 38 are identical:

         date tmin state
36 2018-01-03   -3    OH
38 2018-01-03   -3    OH

This puts tidyr in the impossible position of trying to resolve two data points to the same row and column. In addition, rows 35 and 37 both have the same date and state, once again creating the impossible situation of placing two different values into the same position in the new data frame:

         date tmin state
35 2018-01-03   NA    UT
37 2018-01-03   22    UT

The following data cleanup will make spreading possible:

df %>% 
  filter(!is.na(tmin)) %>% # remove NA values
  unique %>% # remove duplicated rows
  spread(state, tmin)

         date OH UT
1  2018-01-02 -4 24
2  2018-01-03 -3 22
3  2018-01-04 11 19
4  2018-01-05  3 23
5  2018-01-06  0 29
...
Dipody answered 22/1, 2018 at 14:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.