R - Copy values within a group
Asked Answered
H

2

6

I have a dataframe where I have the total number of points someone scored in the past 3 years (2016, 2017, 2018), but also columns with their number of points per year.

My dataframe looks like this:

myDF <- data.frame(ID =c(1,1,1,2,2,3,4),
 Dates= c("2016", "2017", "2018", "2016", "2017", "2018", "2016"),
 Total_Points = c(5, 5, 5, 4, 4, 2, 3),
 Points2016 = c(3, NA, NA, 2, NA, NA, 3),
 Points2017 = c(NA,1,NA,NA,2,NA,NA),
 Points2018= c(NA,NA,1, NA, NA, 2, NA))

The problem is that I would like to copy the values of columns "Points2016", "Points2017" and "Points2017" for every group so that their entries look the same.

I'm not sure the explanation was clear so this would be my expected output:

myDF_final <- data.frame(ID =c(1,1,1,2,2,3,4),
               Dates= c("2016", "2017", "2018", "2016", "2017", "2018", "2016"),
               Total_Points = c(5, 5, 5, 4, 4, 2, 3),
               Points2016 = c(3, 3, 3, 2, 2, NA, 3),
               Points2017 = c(1,1,1,2,2,NA,NA),
               Points2018= c(1,1,1, NA, NA, 2, NA))

Basically, I would like to have the same values for the columns "Points201X" for every ID.

Hawkbill answered 13/3, 2018 at 12:20 Comment(0)
A
15

I think you could just fill by the ID group in both directions. With dplyr and tidyr we could do:

library(dplyr)
library(tidyr)

myDF %>% 
  group_by(ID) %>% 
  fill(Points2016, Points2017, Points2018) %>% 
  fill(Points2016, Points2017, Points2018, .direction = "up")

Returns:

  ID Dates Total_Points Points2016 Points2017 Points2018
1  1  2016            5          3          1          1
2  1  2017            5          3          1          1
3  1  2018            5          3          1          1
4  2  2016            4          2          2         NA
5  2  2017            4          2          2         NA
6  3  2018            2         NA         NA          2
7  4  2016            3          3         NA         NA

Also, if you have a bunch of years say 1970 - 2018, you could do something like:

myDF %>% 
  gather(points_year, points, -c(ID, Dates, Total_Points)) %>% 
  group_by(ID, points_year) %>% 
  fill(points) %>% 
  fill(points, .direction = "up") %>% 
  spread(points_year, points)

So as to avoid typing out every year. However, this involves gathering and spreading the data which might be unnecessary assuming the variables we need to fill follow a consistent naming convention. In this case, there is a consistent naming convention and we could use the tidyselect backend of dplyr to fill all variables that start with the word "Points":

myDF %>% 
  group_by(ID) %>% 
  fill(starts_with("Points"), .direction = "down") %>% 
  fill(starts_with("Points"), .direction = "up")

Alternatively, this seems to work with data.table and zoo:

library(data.table)
library(zoo)

dt <- as.data.table(myDF)

dt <- dt[, names(dt)[4:6] := lapply(.SD, function(x) na.locf0(x)), by = ID, .SDcols = 4:6]
dt <- dt[, names(dt)[4:6] := lapply(.SD, function(x) na.locf0(x, fromLast = TRUE)), by = ID, .SDcols = 4:6]

This one liner seems to do it all in one go as well:

dt[, names(dt)[4:6] := lapply(.SD, function(x) na.locf(x)), by = ID, .SDcols = 4:6]
   ID Dates Total_Points Points2016 Points2017 Points2018
1:  1  2016            5          3          1          1
2:  1  2017            5          3          1          1
3:  1  2018            5          3          1          1
4:  2  2016            4          2          2         NA
5:  2  2017            4          2          2         NA
6:  3  2018            2         NA         NA          2
7:  4  2016            3          3         NA         NA
Arse answered 13/3, 2018 at 12:28 Comment(6)
What is "gather" doing? What are the columns "points_year" and "points"? Thanks a lotHawkbill
gather is taking a wide dataset and making it long by collapsing it into key-value pairs. points_year is the name I'm assigning to the key variable and points is the name I'm assigning to the value variable. See tidyr.tidyverse.org/reference/gather.htmlArse
Thanks! I will try that tonight, my dataset is quite big and it says that it will take 4h to compute :)Hawkbill
Happy to help. You might want to try the data.table solution, should be faster. I’m not a data.table wizard though :)Arse
I find it strange that there is not a more "elegant" solution to this in dplyr like .direction = "any".Bodleian
I'm not sure how "any" would work. It would still have to pick its initial direction and specifying makes it explicit.Arse
J
2

You could also use zoo::na.locf0 to fill NAs from the top & bottom.

library(tidyverse);
library(zoo);
myDF %>%
    group_by(ID) %>%
    mutate_at(vars(contains("Points20")), funs(na.locf0(., fromLast = F))) %>%
    mutate_at(vars(contains("Points20")), funs(na.locf0(., fromLast = T)))
## A tibble: 7 x 6
## Groups:   ID [4]
#     ID Dates Total_Points Points2016 Points2017 Points2018
#  <dbl> <fct>        <dbl>      <dbl>      <dbl>      <dbl>
#1    1. 2016            5.         3.         1.         1.
#2    1. 2017            5.         3.         1.         1.
#3    1. 2018            5.         3.         1.         1.
#4    2. 2016            4.         2.         2.        NA
#5    2. 2017            4.         2.         2.        NA
#6    3. 2018            2.        NA         NA          2.
#7    4. 2016            3.         3.        NA         NA
Jehial answered 13/3, 2018 at 12:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.