Fastest way to add rows for missing time steps?

Asked 3/5, 2012 at 20:37 Answered 30/5, 2017 at 21:19

I have a column in my datasets where time periods (Time) are integers ranging from a-b. Sometimes there might be missing time periods for any given group. I'd like to fill in those rows with NA. Below is example data for 1 (of several 1000) group(s).

structure(list(Id = c(1, 1, 1, 1), Time = c(1, 2, 4, 5), Value = c(0.568780482159894, 
-0.7207749516298, 1.24258192959273, 0.682123081696789)), .Names = c("Id", 
"Time", "Value"), row.names = c(NA, 4L), class = "data.frame")


  Id Time      Value
1  1    1  0.5687805
2  1    2 -0.7207750
3  1    4  1.2425819
4  1    5  0.6821231

As you can see, Time 3 is missing. Often one or more could be missing. I can solve this on my own but am afraid I wouldn't be doing this the most efficient way. My approach would be to create a function that:

Generate a sequence of time periods from min(Time) to max(Time)

Then do a setdiff to grab missing Time values.

Convert that vector to a data.frame

Pull unique identifier variables (Id and others not listed above), and add that to this data.frame.

Merge the two.

Return from function.

So the entire process would then get executed as below:

   # Split the data into individual data.frames by Id.
    temp_list <- dlply(original_data, .(Id)) 
    # pad each data.frame
    tlist2 <- llply(temp_list, my_pad_function)
    # collapse the list back to a data.frame
    filled_in_data <- ldply(tlist2)

Better way to achieve this?

Pothead answered 3/5, 2012 at 20:37 Comment(4)

I would do basically what you describe, only using expand.grid and then merge with all = TRUE. Not sure splitting by Id first is necessary, really. – Fuller 3/5, 2012 at 20:45

Here is an added complication, there are numerous id variables. I only need to add Time and set Value to NA and pad the rest. So then it becomes data_to_merge <- data.frame(id=unique(data$id),...) (which is a really long line and not portable if the structure of the data changes). Wish I could just merge missing Time, add NA and grab all the rest from the original data efficiently. – Pothead 3/5, 2012 at 20:48

I've got it working now but could still use a general solution since this would be in a package and I dont' know what a user might submit as original data. – Pothead 3/5, 2012 at 20:58

For ungrouped data, see also several nice answers in How to Add zero to missing value in r. – Languish 12/1, 2017 at 15:8

Following up on comments with Ben Barnes and starting with his mydf3 :

DT = as.data.table(mydf3)
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time)))]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7           NA  NA
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time)))]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5         NA
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7         NA
 [8,]  1   1    8         NA
 [9,]  1   1    9         NA
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed.

CJ stands for Cross Join, see ?CJ. The padding with NAs happens because nomatch by default is NA. Set nomatch to 0 instead to remove the no matches. If instead of padding with NAs the prevailing row is required, just add roll=TRUE. This can be more efficient than padding with NAs and then filling NAs afterwards. See the description of roll in ?data.table.

setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time))),roll=TRUE]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7 -0.691736451   1
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time))),roll=TRUE]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5 -1.9126874
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7 -0.6917365
 [8,]  1   1    8 -0.6917365
 [9,]  1   1    9 -0.6917365
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed.

Instead of setting keys, you may use on. CJ also takes a unique argument. A small example with two 'Id':

d <- data.table(Id = rep(1:2, 4:3), Time = c(1, 2, 4, 5, 2, 3, 4), val = 1:7)

d[CJ(Id, Time = seq(min(Time), max(Time)), unique = TRUE), on = .(Id, Time)]
#     Id Time val
# 1:   1    1   1
# 2:   1    2   2
# 3:   1    3  NA
# 4:   1    4   3
# 5:   1    5   4
# 6:   2    1  NA
# 7:   2    2   5
# 8:   2    3   6
# 9:   2    4   7
# 10:  2    5  NA

In this particular case, where one of the vectors in CJ was generated with seq, the result needs to be named explictly in order to match the names specified in on. When using bare variables in CJ though (like 'Id' here), they are auto-named, like in data.table() (from data.table 1.12.2).

Evanish answered 6/5, 2012 at 20:18 Comment(9)

I never fully understood the use for roll; how can it help me get the NA's? – Macymad 13/11, 2015 at 3:30

@Macymad The DataCamp data.table course explains it best with Arun and I on screen with animated graphics alongside. Chapter 3 covers joins and rolling joins. It's here. – Evanish 13/11, 2015 at 18:35

How would you pad with seq(min(Time), max(Time)) by group? – Cancel 21/6, 2017 at 1:34

@Cancel For padding with seq() please ask a new question. Too long for comment. – Evanish 27/6, 2017 at 18:44

@Cancel See e.g. Efficient way to Fill Time-Series per group and Filling missing dates by group – Languish 14/10, 2017 at 21:37

@MattDowle Perhaps edit your very nice answer to also include more recent features such as on as an alternative to setkey, and the use of unique argument in CJ. Cheers – Languish 10/5, 2019 at 18:50

@Languish good idea. If anybody else watching who knows about that could make the edit, would really help so I can keep going on the issue/pr list. – Evanish 10/5, 2019 at 23:17

@MattDowle I made an edit. Hope it looks ok. Cheers. – Languish 10/5, 2019 at 23:59

@Languish Looks great. Thanks. – Evanish 13/5, 2019 at 17:32

You can use tidyr for this.

Use tidyr::complete to fill in rows for Time, and by default the values are filled in with NA.

Create Data

I extended the sample data to show that it works for multiple Ids and even when within an Id the full range of Time is not present.

library(dplyr)
library(tidyr)


df <- tibble(
  Id = c(1, 1, 1, 1, 2, 2, 2),
  Time = c(1, 2, 4, 5, 2, 3, 5),
  Value = c(0.56, -0.72, 1.24, 0.68, 1.46, 0.74, 0.99)
)

df
#> # A tibble: 7 x 3
#>      Id  Time Value
#>   <dbl> <dbl> <dbl>
#> 1     1     1  0.56
#> 2     1     2 -0.72
#> 3     1     4  1.24
#> 4     1     5  0.68
#> 5     2     2  1.46
#> 6     2     3  0.74
#> 7     2     5  0.99

Fill in the missing rows

df %>% complete(nesting(Id), Time = seq(min(Time), max(Time), 1L))

#> # A tibble: 10 x 3
#>       Id  Time Value
#>    <dbl> <dbl> <dbl>
#> 1      1     1  0.56
#> 2      1     2 -0.72
#> 3      1     3    NA
#> 4      1     4  1.24
#> 5      1     5  0.68
#> 6      2     1    NA
#> 7      2     2  1.46
#> 8      2     3  0.74
#> 9      2     4    NA
#> 10     2     5  0.99

Mattingly answered 30/5, 2017 at 21:19 Comment(1)

I found this particular solution the easiest to work with to answer the same question as above, especially when using tidyverse for data manipulation. – Teetotaler 20/9, 2017 at 19:47

Please see Matthew Dowle's answer (by now, hopefully above).

Here's something that uses the data.table package, and it may help when there is more than one ID variable. It may also be faster than merge, depending on how you want your results. I'd be interested in benchmarking and/or suggested improvements.

First, create some more demanding data with two ID variables

library(data.table)

set.seed(1)

mydf3<-data.frame(Id=sample(1:100,10000,replace=TRUE),
  Value=rnorm(10000))
mydf3<-mydf3[order(mydf3$Id),]

mydf3$Time<-unlist(by(mydf3,mydf3$Id,
  function(x)sample(1:(nrow(x)+3),nrow(x)),simplify=TRUE))

mydf3$Id2<-sample(1:2,nrow(mydf3),replace=TRUE)

Create a function (This has been EDITED - see history)

padFun<-function(data,idvars,timevar){
# Coerce ID variables to character
  data[,idvars]<-lapply(data[,idvars,drop=FALSE],as.character)
# Create global ID variable of all individual ID vars pasted together
  globalID<-Reduce(function(...)paste(...,sep="SOMETHINGWACKY"),
    data[,idvars,drop=FALSE])
# Create data.frame of all possible combinations of globalIDs and times
  allTimes<-expand.grid(globalID=unique(globalID),
    allTime=min(data[,timevar]):max(data[,timevar]),
    stringsAsFactors=FALSE)
# Get the original ID variables back
  allTimes2<-data.frame(allTimes$allTime,do.call(rbind,
    strsplit(allTimes$globalID,"SOMETHINGWACKY")),stringsAsFactors=FALSE)
# Convert combinations data.frame to data.table with idvars and timevar as key
  allTimesDT<-data.table(allTimes2)
  setnames(allTimesDT,1:ncol(allTimesDT),c(timevar,idvars))
  setkeyv(allTimesDT,c(idvars,timevar))
# Convert data to data.table with same variables as key
  dataDT<-data.table(data,key=c(idvars,timevar))
# Join the two data.tables to create padding
  res<-dataDT[allTimesDT]
  return(res)
}

Use the function

(padded2<-padFun(data=mydf3,idvars=c("Id"),timevar="Time"))

#       Id Time        Value Id2
#  [1,]  1    1 -0.262482283   2
#  [2,]  1    2 -1.423935165   2
#  [3,]  1    3  0.500523295   1
#  [4,]  1    4 -1.912687398   1
#  [5,]  1    5 -1.459766444   2
#  [6,]  1    6 -0.691736451   1
#  [7,]  1    7           NA  NA
#  [8,]  1    8  0.001041489   2
#  [9,]  1    9  0.495820559   2
# [10,]  1   10 -0.673167744   1
# First 10 rows of 12800 printed.

(padded<-padFun(data=mydf3,idvars=c("Id","Id2"),timevar="Time"))

#      Id Id2 Time      Value
#  [1,]  1   1    1         NA
#  [2,]  1   1    2         NA
#  [3,]  1   1    3  0.5005233
#  [4,]  1   1    4 -1.9126874
#  [5,]  1   1    5         NA
#  [6,]  1   1    6 -0.6917365
#  [7,]  1   1    7         NA
#  [8,]  1   1    8         NA
#  [9,]  1   1    9         NA
# [10,]  1   1   10 -0.6731677
# First 10 rows of 25600 printed.

The edited function splits the globalID into its component parts in the combination data.frame, before merging with the original data. This should (I think) be better.

Slicer answered 4/5, 2012 at 0:11 Comment(2)

Right package, but far too complex. Did you somehow miss roll=TRUE which is specifically for this? Keep the data irregular in the table, then join the regular time series to it. See the 3rd section of the "intro to data.table" vignette and the example in ?data.table using roll=TRUE. It's one of the main features of the package. – Evanish 4/5, 2012 at 14:32

@MatthewDowle, You know your own functions much better than I, and any suggestions for improvement are very welcome. Since the OP wanted to "pad" non-ID and non-time variables with NA for each missing Time, I thought using roll=TRUE would be the wrong approach, since this would "pad" with the previous value, right? It seems that most of the complexity in the function has to do with considering multiple ID variables, not with the join part, which is very nicely achievable with data.table. – Slicer 4/5, 2012 at 17:50

My general approach is to use freqTable <- as.data.frame(table(idvar1, idvar2, idvarN)) then pull out the rows where Freq==0, pad as necessary and then stack back onto the original data.

Hernia answered 3/5, 2012 at 21:25 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Create Data

Fill in the missing rows

Recommended topics

Hot tags