R - Copy values within a group

Asked 13/3, 2018 at 12:20 Answered 13/3, 2018 at 12:41

I have a dataframe where I have the total number of points someone scored in the past 3 years (2016, 2017, 2018), but also columns with their number of points per year.

My dataframe looks like this:

myDF <- data.frame(ID =c(1,1,1,2,2,3,4),
 Dates= c("2016", "2017", "2018", "2016", "2017", "2018", "2016"),
 Total_Points = c(5, 5, 5, 4, 4, 2, 3),
 Points2016 = c(3, NA, NA, 2, NA, NA, 3),
 Points2017 = c(NA,1,NA,NA,2,NA,NA),
 Points2018= c(NA,NA,1, NA, NA, 2, NA))

The problem is that I would like to copy the values of columns "Points2016", "Points2017" and "Points2017" for every group so that their entries look the same.

I'm not sure the explanation was clear so this would be my expected output:

myDF_final <- data.frame(ID =c(1,1,1,2,2,3,4),
               Dates= c("2016", "2017", "2018", "2016", "2017", "2018", "2016"),
               Total_Points = c(5, 5, 5, 4, 4, 2, 3),
               Points2016 = c(3, 3, 3, 2, 2, NA, 3),
               Points2017 = c(1,1,1,2,2,NA,NA),
               Points2018= c(1,1,1, NA, NA, 2, NA))

Basically, I would like to have the same values for the columns "Points201X" for every ID.

Hawkbill answered 13/3, 2018 at 12:20 Comment(0)

I think you could just fill by the ID group in both directions. With dplyr and tidyr we could do:

library(dplyr)
library(tidyr)

myDF %>% 
  group_by(ID) %>% 
  fill(Points2016, Points2017, Points2018) %>% 
  fill(Points2016, Points2017, Points2018, .direction = "up")

Returns:

  ID Dates Total_Points Points2016 Points2017 Points2018
1  1  2016            5          3          1          1
2  1  2017            5          3          1          1
3  1  2018            5          3          1          1
4  2  2016            4          2          2         NA
5  2  2017            4          2          2         NA
6  3  2018            2         NA         NA          2
7  4  2016            3          3         NA         NA

Also, if you have a bunch of years say 1970 - 2018, you could do something like:

myDF %>% 
  gather(points_year, points, -c(ID, Dates, Total_Points)) %>% 
  group_by(ID, points_year) %>% 
  fill(points) %>% 
  fill(points, .direction = "up") %>% 
  spread(points_year, points)

So as to avoid typing out every year. However, this involves gathering and spreading the data which might be unnecessary assuming the variables we need to fill follow a consistent naming convention. In this case, there is a consistent naming convention and we could use the tidyselect backend of dplyr to fill all variables that start with the word "Points":

myDF %>% 
  group_by(ID) %>% 
  fill(starts_with("Points"), .direction = "down") %>% 
  fill(starts_with("Points"), .direction = "up")

Alternatively, this seems to work with data.table and zoo:

library(data.table)
library(zoo)

dt <- as.data.table(myDF)

dt <- dt[, names(dt)[4:6] := lapply(.SD, function(x) na.locf0(x)), by = ID, .SDcols = 4:6]
dt <- dt[, names(dt)[4:6] := lapply(.SD, function(x) na.locf0(x, fromLast = TRUE)), by = ID, .SDcols = 4:6]

This one liner seems to do it all in one go as well:

dt[, names(dt)[4:6] := lapply(.SD, function(x) na.locf(x)), by = ID, .SDcols = 4:6]

   ID Dates Total_Points Points2016 Points2017 Points2018
1:  1  2016            5          3          1          1
2:  1  2017            5          3          1          1
3:  1  2018            5          3          1          1
4:  2  2016            4          2          2         NA
5:  2  2017            4          2          2         NA
6:  3  2018            2         NA         NA          2
7:  4  2016            3          3         NA         NA

Arse answered 13/3, 2018 at 12:28 Comment(6)

What is "gather" doing? What are the columns "points_year" and "points"? Thanks a lot – Hawkbill 13/3, 2018 at 12:43

gather is taking a wide dataset and making it long by collapsing it into key-value pairs. points_year is the name I'm assigning to the key variable and points is the name I'm assigning to the value variable. See tidyr.tidyverse.org/reference/gather.html – Arse 13/3, 2018 at 12:46

Thanks! I will try that tonight, my dataset is quite big and it says that it will take 4h to compute :) – Hawkbill 13/3, 2018 at 12:59

Happy to help. You might want to try the data.table solution, should be faster. I’m not a data.table wizard though :) – Arse 13/3, 2018 at 13:58

I find it strange that there is not a more "elegant" solution to this in dplyr like .direction = "any". – Bodleian 13/2, 2019 at 14:23

I'm not sure how "any" would work. It would still have to pick its initial direction and specifying makes it explicit. – Arse 13/2, 2019 at 16:27

You could also use zoo::na.locf0 to fill NAs from the top & bottom.

library(tidyverse);
library(zoo);
myDF %>%
    group_by(ID) %>%
    mutate_at(vars(contains("Points20")), funs(na.locf0(., fromLast = F))) %>%
    mutate_at(vars(contains("Points20")), funs(na.locf0(., fromLast = T)))
## A tibble: 7 x 6
## Groups:   ID [4]
#     ID Dates Total_Points Points2016 Points2017 Points2018
#  <dbl> <fct>        <dbl>      <dbl>      <dbl>      <dbl>
#1    1. 2016            5.         3.         1.         1.
#2    1. 2017            5.         3.         1.         1.
#3    1. 2018            5.         3.         1.         1.
#4    2. 2016            4.         2.         2.        NA
#5    2. 2017            4.         2.         2.        NA
#6    3. 2018            2.        NA         NA          2.
#7    4. 2016            3.         3.        NA         NA

Jehial answered 13/3, 2018 at 12:41 Comment(0)

Recommended topics

Hot tags