Calculate timestamps for data with known frequency and missing data
Asked Answered
S

1

7

I have data as follows, where data of type "S" contains a timestamp, and I need to assign timestamps to "D" lines.

   type  timestamp               count
   <chr> <dttm>                  <int>
 1 $     NA                         NA
 2 D     NA                        229
 3 M     NA                         NA
 4 D     NA                        230
 5 D     NA                        231
 6 D     NA                        232
 7 D     NA                        233
 8 D     NA                        234
 9 D     NA                        235
10 D     NA                        236
11 D     NA                        237
12 D     NA                        238
13 D     NA                        239
14 S     2024-01-24 16:11:11.000    NA
15 D     NA                        241
16 D     NA                        242
17 D     NA                        243
18 D     NA                        126
19 D     NA                        127
20 S     2024-01-24 16:13:29.000    NA
21 D     NA                        128

"Count" is a 1 byte iterator that goes from 0-255 and repeats. Missing counts indicate missing data lines. Data lines are sent at 16Hz, so each count iteration represents 1/16 sec. I'm trying to assign the correct timestamps using the counts of the D lines to get the nearest S line timestamp and calculate the timestamp by the difference in count between the current D line and the D line immediately following an S line. Typically the S lines are every second, but I picked this subset to show some of the issues with the data, mainly the gap of 2:18 at line 17.

I've figured out in a way that works, but is incredibly slow (4ms/row, need to process ~1M lines of data per day for files that span many days). The real data is in files with lines in several formats (ick), and the times and counts in this example are parsed out of that. This is beginning to sound like a adventofcode problem, but sadly, this system is real.

If you'd like to check out my slow solution or see more complete data, it's in this file in the repo: https://github.com/blongworth/mlabtools/blob/main/R/time_alignment.R The data above are simplified, so the method in the repo won't work on the reprex data without modification. There are tests, but not a set for how the result from this Reprex should look yet.

Any ideas for how to do this efficiently? I'll likely have to go to data.tables eventually, but as long as I have a start on more efficient logic, I think I can get there.

Here's dput output for the test df above:

structure(list(type = c("$", "D", "M", "D", "D", "D", "D", "D", 
"D", "D", "D", "D", "D", "S", "D", "D", "D", "D", "D", "S", "D"
), timestamp = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 1706130671, NA, NA, NA, NA, NA, 1706130809, NA
), tzone = "America/New_York", class = c("POSIXct", "POSIXt")), 
    count = c(NA, 229L, NA, 230L, 231L, 232L, 233L, 234L, 235L, 
    236L, 237L, 238L, 239L, NA, 241L, 242L, 243L, 126L, 127L, 
    NA, 128L)), row.names = c(NA, -21L), class = c("tbl_df", 
"tbl", "data.frame"))

Here's the example data with the expected output:

   type  timestamp               count
   <chr> <dttm>                  <int>
 1 $     NA                         NA
 2 D     2024-01-24 16:11:10.250   229
 3 M     NA                         NA
 4 D     2024-01-24 16:11:10.312   230
 5 D     2024-01-24 16:11:10.375   231
 6 D     2024-01-24 16:11:10.437   232
 7 D     2024-01-24 16:11:10.500   233
 8 D     2024-01-24 16:11:10.562   234
 9 D     2024-01-24 16:11:10.625   235
10 D     2024-01-24 16:11:10.687   236
11 D     2024-01-24 16:11:10.750   237
12 D     2024-01-24 16:11:10.812   238
13 D     2024-01-24 16:11:10.875   239
14 S     2024-01-24 16:11:11.000    NA
15 D     2024-01-24 16:11:11.000   241
16 D     2024-01-24 16:11:11.062   242
17 D     2024-01-24 16:11:11.125   243
18 D     2024-01-24 16:13:28.875   126
19 D     2024-01-24 16:13:28.937   127
20 S     2024-01-24 16:13:29.000    NA
21 D     2024-01-24 16:13:29.000   128
Suber answered 26/1 at 12:49 Comment(8)
You appear to have lines of type $, D, M, S and possibly others. If the tick interval is exactly 1/16s then you can predict what count you expect at any given timestamp or vice versa. You run into trouble if there are more than one solutions between your pair of time stamps. According to your description the byte counter will have overflowed to zero something like 10 times between line 14 and line 20. There is no possibility of giving the intervening values unambiguous timestamps. You need more frequent time stamps or a longer counter.Monograph
I need to make assumptions to cover most of the missing data cases. In this case, I am assuming "D" lines in rows 2-17 correspond to the "S" line in row 14, and "D" lines in rows 18-21 correspond to the "S" line in row 20. The real solution is to fix the data source, but I can't do that for old data I've been given.Suber
please post the result expected for the example data frameChristogram
Added, should make what I'm after clearer. Thx.Suber
your slow solution doesn't actually run on the example dataframe (Column row_num doesn't exist.)Donniedonnish
@mark, the data here are a reprex dataframe that provides only the fields needed to figure out an algorithm. I'll make that clear in the question.Suber
@Suber can you create a realistic reproducible example?Donniedonnish
I ask because, otherwise your example code is a bit uselessDonniedonnish
A
2

Here's a shot that goes through some timestamp-gymnastics.

library(dplyr)
# library(tidyr) # fill
df |>
  mutate(count2 = count, nexttime = timestamp, prevtime = timestamp) |>
  tidyr::fill(count2, .direction = "updown") |>
  mutate(
    count2 = count2 + 256*cumsum(c(FALSE, diff(count2) < 0)),
    nextind = if_else(is.na(timestamp), count2[NA], count2),
    prevind = nextind
  ) |>
  tidyr::fill(prevtime, prevind, .direction = "down") |>
  tidyr::fill(nexttime, nextind, .direction = "up") |>
  mutate(
    newtimestamp = case_when(
      !is.na(timestamp) ~ timestamp,
      is.na(prevtime) | abs(count2 - nextind) < abs(count2 - prevind) ~
        nexttime + (count2 - nextind)/16,
      TRUE ~
        prevtime + (count2 - prevind)/16
    )
  ) |>
  select(names(df), newtimestamp)
# # A tibble: 21 × 4
#    type  timestamp               count newtimestamp           
#    <chr> <dttm>                  <int> <dttm>                 
#  1 $     NA                         NA 2024-01-24 16:11:10.250
#  2 D     NA                        229 2024-01-24 16:11:10.250
#  3 M     NA                         NA 2024-01-24 16:11:10.312
#  4 D     NA                        230 2024-01-24 16:11:10.312
#  5 D     NA                        231 2024-01-24 16:11:10.375
#  6 D     NA                        232 2024-01-24 16:11:10.437
#  7 D     NA                        233 2024-01-24 16:11:10.500
#  8 D     NA                        234 2024-01-24 16:11:10.562
#  9 D     NA                        235 2024-01-24 16:11:10.625
# 10 D     NA                        236 2024-01-24 16:11:10.687
# 11 D     NA                        237 2024-01-24 16:11:10.750
# 12 D     NA                        238 2024-01-24 16:11:10.812
# 13 D     NA                        239 2024-01-24 16:11:10.875
# 14 S     2024-01-24 16:11:11.000    NA 2024-01-24 16:11:11.000
# 15 D     NA                        241 2024-01-24 16:11:11.000
# 16 D     NA                        242 2024-01-24 16:11:11.062
# 17 D     NA                        243 2024-01-24 16:11:11.125
# 18 D     NA                        126 2024-01-24 16:13:28.875
# 19 D     NA                        127 2024-01-24 16:13:28.937
# 20 S     2024-01-24 16:13:29.000    NA 2024-01-24 16:13:29.000
# 21 D     NA                        128 2024-01-24 16:13:29.000

Notes:

  • count2 is just count fully interpolated for NAs
  • The use of nexttime/prevtime is to carry-forward and carry-backward timestamp until there is another non-NA timestamp, I choose which to use in the case_when;
  • The nextind/prevind are used to subtract from count2 so that I can account for 1/16th seconds.
  • The case_when is really where most of the logic works, determine if the original timestamp should be retained, or (count2-nextind)/16 (or prevind) 1/16ths seconds from the nexttime (prevtime).

A data.table solution looks fairly similar. Using R-4.2 or newer, we can use |> _[] formatting:

library(data.table)
out <- as.data.table(df) |>
  _[, count2 := nafill(nafill(count, type = "nocb"), type = "locf") ] |>
  _[, count2 := count2 + 256*cumsum(c(FALSE, diff(count2) < 0)) ] |>
  _[, nextind := fifelse(is.na(timestamp), count2[NA], count2) ] |>
  _[, prevind := nextind ] |>
  _[, c("prevtime", "prevind") := lapply(.SD, nafill, type = "locf"), .SDcols = c("timestamp", "prevind")] |>
  _[, c("nexttime", "nextind") := lapply(.SD, nafill, type = "nocb"), .SDcols = c("timestamp", "nextind")] |>
  _[, newtimestamp := fcase(
    !is.na(timestamp), timestamp,
    is.na(prevtime) | abs(count2 - nextind) < abs(count2 - prevind), nexttime + (count2 - nextind)/16,
    rep(TRUE, .N), prevtime + (count2 - prevind)/16) ] |>
  _[, .SD, .SDcols = c(names(df), "newtimestamp")]

If on R before 4.2, we can use data.table's ][-piping.

DT <- as.data.table(df) # setDT(df) is canonical, avoiding that here for side-effect
DT[, count2 := nafill(nafill(count, type = "nocb"), type = "locf")
   ][, count2 := count2 + 256*cumsum(c(FALSE, diff(count2) < 0))
   ][, nextind := fifelse(is.na(timestamp), count2[NA], count2)
   ][, prevind := nextind
   ][, c("prevtime", "prevind") := lapply(.SD, nafill, type = "locf"), .SDcols = c("timestamp", "prevind")
   ][, c("nexttime", "nextind") := lapply(.SD, nafill, type = "nocb"), .SDcols = c("timestamp", "nextind")
   ][, newtimestamp := fcase(
     !is.na(timestamp), timestamp,
     is.na(prevtime) | abs(count2 - nextind) < abs(count2 - prevind), nexttime + (count2 - nextind)/16,
     rep(TRUE, .N), prevtime + (count2 - prevind)/16)
   ][, .SD, .SDcols = c(names(df), "newtimestamp")]

I prefer tidyr::fill's .direction="updown", it reduces the call stack and is easier to read in pipes like this.

Ablebodied answered 30/1 at 1:37 Comment(5)
On my fairly ordinary laptop this chews its way through 100M rows of mock data in 12 seconds, which is about ~30,000x faster than the 4ms/row quoted in the OP (if my sums are right). It's also faster than a data.table/collapse solution I'd been working on, which ran in 20s.Confectionary
@Tobo, I think you should keep your data.table+collapse answer undeleted. Even if it takes slightly longer, we are not assured at this point that (1) my answer is correct with the real data, and (2) perhaps they're on a package-diet and tidyverse doesn't fit in.Ablebodied
I'd love to see a data.table solution. I've been meaning to learn that in general.Suber
Thanks! Still working through the dplyr solution with a larger, messier dataset, but it's looking great so far! The two edge cases I've seen so far (both out of scope of the original post) are matching "D" lines with timestamps across group boundaries (short subsets of data from the field are sent via cell every 4h) and an issue with duplicate timestamps (rare). Grouping by subset/send before adding timestamps solves the first case. I think the second is caused by missing the first "D" line after an "S" line, which offsets the time of that "D" line back by 1/16s.Suber
I ended up using this solution, but modified it to use the previous timestamp unless there was a gap in count in order to take care of the problem with duplicate timestamps when timestamping line n using the prev timestamp and line n+1 using the next timestamp produces the same time. This wasn't seen in the example data, so better to accept this as is. Thank you all again for your help!Suber

© 2022 - 2024 — McMap. All rights reserved.