Replace missing values (NA) with most recent non-NA by group
Asked Answered
A

7

54

I would like to solve the following problem with dplyr. Preferable with one of the window-functions. I have a data frame with houses and buying prices. The following is an example:

houseID      year    price 
1            1995    NA
1            1996    100
1            1997    NA
1            1998    120
1            1999    NA
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    NA
3            1995    NA
3            1996    44
3            1997    NA
3            1998    NA
3            1999    NA

I would like to make a data frame like this:

houseID      year    price 
1            1995    NA
1            1996    100
1            1997    100
1            1998    120
1            1999    120
2            1995    NA
2            1996    NA
2            1997    NA
2            1998    30
2            1999    30
3            1995    NA
3            1996    44
3            1997    44
3            1998    44
3            1999    44

Here are some data in the right format:

# Number of houses
N = 15

# Data frame
df = data.frame(houseID = rep(1:N,each=10), year=1995:2004, price =ifelse(runif(10*N)>0.15, NA,exp(rnorm(10*N))))

Is there a dplyr-way to do that?

Aleron answered 28/4, 2014 at 11:42 Comment(0)
P
51

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3]
Groups: houseID

   houseID year price
1        1 1995    NA
2        1 1996   100
3        1 1997   100
4        1 1998   120
5        1 1999   120
6        2 1995    NA
7        2 1996    NA
8        2 1997    NA
9        2 1998    30
10       2 1999    30
11       3 1995    NA
12       3 1996    44
13       3 1997    44
14       3 1998    44
15       3 1999    44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

by

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

giving:

       1  2  3
1995  NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
  2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L, 
  1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 
  1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 
  30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year", 
  "price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

Playpen answered 28/4, 2014 at 12:43 Comment(11)
I have shown the output I get near the botom. Try grabbing the latest version of dplyr in case its your version of dplyr that is causing problems: devtools::install_github("haldey/dplyr")Playpen
I tried your code and got the error: Error in na.locf(.) : object '.' not found. I am working on a server and can not change dplyr-version on the fly (It is 2 month old).Aleron
Here is the by solution reworked to use dplyr: df %.% by(df$houseID, na.locf) %.% do.call(what = rbind) . If none of these dplyr solutions work for you, likely because of your old version, then use one of the other solutions I have provided.Playpen
This on works: df %.% group_by(houseID) %.% mutate(p2 = na.locf(price,na.rm=F))Aleron
I have added a second dplyr solution which combines the by and the dplyr solutions and may (or may not) work with your older version of dplyr.Playpen
Thanks a lot. Very impressive!Aleron
Since there are many way to do this, one criterion to choose the best approach is speed. I have tested zoo, dplyr and data.table solutions; in my experience the fastest is data.table.Camelback
Other ways are readability, conciseness, simplicity and lack of dependencies.Playpen
how to only apply {na.locf} to selected column,say price (assuming year has missing values too).This your example: df %>% group_by(houseID) %>% na.locf %>% ungroupDifficile
Use mutate as in the comment by @Peter S.Playpen
The first solution has the problem of filling in from the previous group if the first value in a group is NA. For example, data.frame(x=c(1,1,1,2,2), y=c(1:3, 2,3), z=c(4:6, 1,2)) %>% group_by(x) %>% complete(y=1:3) %>% na.locf(na.rm=FALSE) fills in the z column with 6 when it should be undefined (i.e., NA)Lilongwe
I
84

tidyr::fill now makes this stupidly easy:

library(dplyr)
library(tidyr)
# or library(tidyverse)

df %>% group_by(houseID) %>% fill(price)
# Source: local data frame [15 x 3]
# Groups: houseID [3]
# 
#    houseID  year price
#      (int) (int) (int)
# 1        1  1995    NA
# 2        1  1996   100
# 3        1  1997   100
# 4        1  1998   120
# 5        1  1999   120
# 6        2  1995    NA
# 7        2  1996    NA
# 8        2  1997    NA
# 9        2  1998    30
# 10       2  1999    30
# 11       3  1995    NA
# 12       3  1996    44
# 13       3  1997    44
# 14       3  1998    44
# 15       3  1999    44
Inherited answered 12/5, 2016 at 6:50 Comment(0)
P
51

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3]
Groups: houseID

   houseID year price
1        1 1995    NA
2        1 1996   100
3        1 1997   100
4        1 1998   120
5        1 1999   120
6        2 1995    NA
7        2 1996    NA
8        2 1997    NA
9        2 1998    30
10       2 1999    30
11       3 1995    NA
12       3 1996    44
13       3 1997    44
14       3 1998    44
15       3 1999    44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

by

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

giving:

       1  2  3
1995  NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
  2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L, 
  1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 
  1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 
  30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year", 
  "price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

Playpen answered 28/4, 2014 at 12:43 Comment(11)
I have shown the output I get near the botom. Try grabbing the latest version of dplyr in case its your version of dplyr that is causing problems: devtools::install_github("haldey/dplyr")Playpen
I tried your code and got the error: Error in na.locf(.) : object '.' not found. I am working on a server and can not change dplyr-version on the fly (It is 2 month old).Aleron
Here is the by solution reworked to use dplyr: df %.% by(df$houseID, na.locf) %.% do.call(what = rbind) . If none of these dplyr solutions work for you, likely because of your old version, then use one of the other solutions I have provided.Playpen
This on works: df %.% group_by(houseID) %.% mutate(p2 = na.locf(price,na.rm=F))Aleron
I have added a second dplyr solution which combines the by and the dplyr solutions and may (or may not) work with your older version of dplyr.Playpen
Thanks a lot. Very impressive!Aleron
Since there are many way to do this, one criterion to choose the best approach is speed. I have tested zoo, dplyr and data.table solutions; in my experience the fastest is data.table.Camelback
Other ways are readability, conciseness, simplicity and lack of dependencies.Playpen
how to only apply {na.locf} to selected column,say price (assuming year has missing values too).This your example: df %>% group_by(houseID) %>% na.locf %>% ungroupDifficile
Use mutate as in the comment by @Peter S.Playpen
The first solution has the problem of filling in from the previous group if the first value in a group is NA. For example, data.frame(x=c(1,1,1,2,2), y=c(1:3, 2,3), z=c(4:6, 1,2)) %>% group_by(x) %>% complete(y=1:3) %>% na.locf(na.rm=FALSE) fills in the z column with 6 when it should be undefined (i.e., NA)Lilongwe
E
13

You can do a rolling self-join, supported by data.table:

require(data.table)
setDT(df)   ## change it to data.table in place
setkey(df, houseID, year)     ## needed for fast join
df.woNA <- df[!is.na(price)]  ## version without the NA rows

# rolling self-join will return what you want
df.woNA[df, roll=TRUE]  ## will match previous year if year not found
Egad answered 28/4, 2014 at 12:34 Comment(0)
L
8

Pure dplyr solution (no zoo).

df %>% 
 group_by(houseID) %>%
 mutate(price_change = cumsum(0 + !is.na(price))) %>%
 group_by(price_change, add = TRUE) %>%
 mutate(price_filled = nth(price, 1)) %>%
 ungroup() %>%
 select(-price_change) -> df2

Interesting part of example solution is at the end of df2.

> tail(df2, 20)
Source: local data frame [20 x 4]

    houseID year     price price_filled
 1       14 1995        NA           NA
 2       14 1996        NA           NA
 3       14 1997        NA           NA
 4       14 1998        NA           NA
 5       14 1999 0.8374778    0.8374778
 6       14 2000        NA    0.8374778
 7       14 2001        NA    0.8374778
 8       14 2002        NA    0.8374778
 9       14 2003 2.1918880    2.1918880
10       14 2004        NA    2.1918880
11       15 1995        NA           NA
12       15 1996 0.3982450    0.3982450
13       15 1997        NA    0.3982450
14       15 1998 1.7727000    1.7727000
15       15 1999        NA    1.7727000
16       15 2000        NA    1.7727000
17       15 2001        NA    1.7727000
18       15 2002 7.8636329    7.8636329
19       15 2003        NA    7.8636329
20       15 2004        NA    7.8636329
Lithium answered 13/12, 2014 at 17:29 Comment(0)
D
3

A dplyr and imputeTS combination.

library(dplyr)
library(imputeTS)
df %>% group_by(houseID) %>% 
mutate(price = na.locf(price, na.remaining="keep"))  

You could also replace na.locf with more advanced missing data replacement (imputation) functions from imputeTS. For example na.interpolation or na.kalman. For this just replace na.locf with the name of the function you like.

Declination answered 8/10, 2018 at 3:56 Comment(0)
A
2

Without dplyr:

  prices$price <-unlist(lapply(split(prices$price,prices$houseID),
function(x) zoo::na.locf(x,na.rm=FALSE)))

prices
   houseID year price
1        1 1995    NA
2        1 1996   100
3        1 1997   100
4        1 1998   120
5        1 1999   120
6        2 1995    NA
7        2 1996    NA
8        2 1997    NA
9        2 1998    30
10       2 1999    30
11       3 1995    NA
12       3 1996    44
13       3 1997    44
14       3 1998    44
15       3 1999    44
Abject answered 28/4, 2014 at 12:55 Comment(1)
You should ad the library(zoo) in your code since you are using that library.Assistant
C
2

Since data.table v1.12.4, the package has a nafill() function, similar to tidyr::fill() or zoo::na.locf() and you can do:

require(data.table)
setDT(df)

df[ , price := nafill(price, type = 'locf'), houseID ]

There is also setnafill(), though not allowing for a group by, but multiple columns.

setnafill(df, type = 'locf', cols = 'price')

Data taken from @G. Grothendieck's answer:

df = data.frame(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                            2L, 3L, 3L, 3L, 3L, 3L),
                year = c(1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L,
                         1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 1998L, 1999L),
                price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 30L, NA, NA, 44L,
                          NA, NA, NA))
Crucify answered 10/10, 2020 at 17:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.