Interpolating NA's by group using dplyr on multiple columns
Asked Answered
V

4

5

I have a data frame like this:

> head(df1)
  iso year var1 var2 var3
1 XXX 2005  165   29 2151
2 XXX 2006  160   21 2139
3 XXX 2007   NA   NA   NA
4 XXX 2008  184    9 3640
5 XXX 2009   NA   NA   NA
6 YYY 2005  206  461 8049 

I want to replace the NA's of intermittent years based on the years around it and the NA's in years at the beginning and end of the range by carrying backward and forward the outer most non-NA observation.

My code to do this for one column is:

df1 %>% 
 group_by(iso) %>%
 mutate(var1 = na.approx(var1, na.rm = FALSE, rule = 1)) %>%
 mutate(var1 = na.locf(var1, na.rm = FALSE)) %>%
 mutate(var1 = na.locf(var1, na.rm = FALSE, fromLast = TRUE))

This works, so now I want to do this for all columns in one go (there are more than 3 and they are not numbered like in my example). This I pieced together from the answers to this question. I omitted the two calls to na.locf.

columnnames <- c("var1, "var2", "var3")
df1 %>%
 group_by(iso) %>%
 mutate_at(.vars = vars(columnnames), .funs = funs(na.approx(., na.rm = FALSE, rule = 1)))

This throws me an error and a warning:

Error in approx(x[!na], y[!na], xout, ...) : need at least two non-NA values to interpolate In addition: Warning message: In xy.coords(x, y, setLab = FALSE) : NAs introduced by coercion

I think I understand the error, but I did not get it when I used the first piece of code on var1. The warning I don't follow. How cal I apply my code to all columns in my data frame? I also tried putting evertything in a loop, looping over columnnames but that didn't work either (and it it probably not the best way to go about this).

Vinita answered 13/7, 2018 at 18:46 Comment(0)
B
2

You can re-write your code using mutate_at so that conversion can be done in one go as:

library(dplyr)
library(zoo)


df %>% 
  group_by(iso) %>%
  mutate_at(vars(starts_with("var")), 
            funs(na.locf(na.locf(na.approx(., na.rm = FALSE, rule = 1),na.rm=FALSE),
                                                              fromLast=TRUE)))


# # A tibble: 6 x 5
# # Groups: iso [2]
# iso    year  var1   var2  var3
# <chr> <int> <dbl>  <dbl> <dbl>
# 1 XXX    2005   165  29.0   2151
# 2 XXX    2006   160  21.0   2139
# 3 XXX    2007   172  15.0   2890
# 4 XXX    2008   184   9.00  3640
# 5 XXX    2009   184   9.00  3640
# 6 YYY    2005   206 461     8049
# 

Data:

df <- read.table(text=
"iso year var1 var2 var3
1 XXX 2005  165   29 2151
2 XXX 2006  160   21 2139
3 XXX 2007   NA   NA   NA
4 XXX 2008  184    9 3640
5 XXX 2009   NA   NA   NA
6 YYY 2005  206  461 8049",
header = TRUE, stringsAsFactors = FALSE)
Bacillary answered 13/7, 2018 at 19:1 Comment(2)
@Moody_Mudskipper No issue. Actually, that give me a chance to correct mutate_at:-)Bacillary
I am sorry if I was not clear, but my actual column names are not var1, var2 etc. so I cannot reference them like you suggested. Using @www 's var referencing and your nesting of functions made it work. Thanks!Vinita
P
3

Use na.approx with method = "constant" (same as na.locf) and rule = 2 (means extend nearest value to leading and trailing NAs). If you want the NAs to be linearly interpolated instead remove the method="constant" argument.

df1 %>%
  group_by(iso) %>%
  mutate_at(vars(-iso), funs(na.approx(., method = "constant", rule = 2))) %>%
  ungroup

giving:

# A tibble: 6 x 5
  iso    year  var1  var2  var3
  <fct> <dbl> <dbl> <dbl> <dbl>
1 XXX    2005   165    29  2151
2 XXX    2006   160    21  2139
3 XXX    2007   160    21  2139
4 XXX    2008   184     9  3640
5 XXX    2009   184     9  3640
6 YYY    2005   206   461  8049

Note

df1 in reproducible form is:

df1 <- 
structure(list(iso = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("XXX", 
"YYY"), class = "factor"), year = c(2005L, 2006L, 2007L, 2008L, 
2009L, 2005L), var1 = c(165L, 160L, NA, 184L, NA, 206L), var2 = c(29L, 
21L, NA, 9L, NA, 461L), var3 = c(2151L, 2139L, NA, 3640L, NA, 
8049L)), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6"))
Pact answered 13/7, 2018 at 22:2 Comment(4)
So if I understand correctly this interpolates the data for 2007 for XXX as the average of 2006 and 2008, but applies the 2008 data to 2009?Vinita
No. As explained in the answer wth method="constant" it works like na.locf. It fills in the NAs with the most recent prior non-NA like na.locf does but also it fills in leading NAs with the first non-NA. I have added the output to the answer. Look at ?na.approx and for the method and rule arguments look at ?approx.Pact
Okay, thanks. That is not what I was looking for here, but it is good to know.Vinita
The question uses na.locf suggesting that that functionality was what you wanted to use. but if you want linear interplation of surrounding points remove the method = "constant" argument as the default for na.approx is to employ linear interpolation.Pact
E
2

We can use mutate_at. The key is to specify the right columns in the vars argument, which uses the same rule as the select function. Therefore, in this case, vars(starts_with("var")) will also work.

library(dplyr)
library(zoo)

df1 %>% 
  group_by(iso) %>%
  mutate_at(vars(-iso, -year), funs(na.approx(., na.rm = FALSE, rule = 1))) %>%
  mutate_at(vars(-iso, -year), funs(na.locf(., na.rm = FALSE))) %>%
  mutate_at(vars(-iso, -year), funs(na.locf(., na.rm = FALSE, fromLast = TRUE)))
# # A tibble: 6 x 5
# # Groups:   iso [2]
#   iso    year  var1  var2  var3
#   <chr> <int> <dbl> <dbl> <dbl>
# 1 XXX    2005   165    29 2151 
# 2 XXX    2006   160    21 2139 
# 3 XXX    2007   172    15 2890.
# 4 XXX    2008   184     9 3640 
# 5 XXX    2009   184     9 3640 
# 6 YYY    2005   206   461 8049 

DATA

df1 <- read.table(text = "  iso year var1 var2 var3
1 XXX 2005  165   29 2151
2 XXX 2006  160   21 2139
3 XXX 2007   NA   NA   NA
4 XXX 2008  184    9 3640
5 XXX 2009   NA   NA   NA
6 YYY 2005  206  461 8049 ",
                 header = TRUE, stringsAsFactors = FALSE)
Exteriorize answered 13/7, 2018 at 18:52 Comment(1)
why you have used three different mutate_at? Can you add some comments for better understanding?Bifocals
B
2

You can re-write your code using mutate_at so that conversion can be done in one go as:

library(dplyr)
library(zoo)


df %>% 
  group_by(iso) %>%
  mutate_at(vars(starts_with("var")), 
            funs(na.locf(na.locf(na.approx(., na.rm = FALSE, rule = 1),na.rm=FALSE),
                                                              fromLast=TRUE)))


# # A tibble: 6 x 5
# # Groups: iso [2]
# iso    year  var1   var2  var3
# <chr> <int> <dbl>  <dbl> <dbl>
# 1 XXX    2005   165  29.0   2151
# 2 XXX    2006   160  21.0   2139
# 3 XXX    2007   172  15.0   2890
# 4 XXX    2008   184   9.00  3640
# 5 XXX    2009   184   9.00  3640
# 6 YYY    2005   206 461     8049
# 

Data:

df <- read.table(text=
"iso year var1 var2 var3
1 XXX 2005  165   29 2151
2 XXX 2006  160   21 2139
3 XXX 2007   NA   NA   NA
4 XXX 2008  184    9 3640
5 XXX 2009   NA   NA   NA
6 YYY 2005  206  461 8049",
header = TRUE, stringsAsFactors = FALSE)
Bacillary answered 13/7, 2018 at 19:1 Comment(2)
@Moody_Mudskipper No issue. Actually, that give me a chance to correct mutate_at:-)Bacillary
I am sorry if I was not clear, but my actual column names are not var1, var2 etc. so I cannot reference them like you suggested. Using @www 's var referencing and your nesting of functions made it work. Thanks!Vinita
C
2

Here is a base solution:

ave(df,df$iso, FUN =function(y){
  if(nrow(y) > 1) y[3:5] <- lapply(y[3:5], function(x) approx(y$year,x,y$year,rule=2)$y)
  y
})

#   iso year var1 var2   var3
# 1 XXX 2005  165   29 2151.0
# 2 XXX 2006  160   21 2139.0
# 3 XXX 2007  172   15 2889.5
# 4 XXX 2008  184    9 3640.0
# 5 XXX 2009  184    9 3640.0
# 6 YYY 2005  206  461 8049.0
Calorifacient answered 13/7, 2018 at 19:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.