I have data as follows, where data of type "S" contains a timestamp, and I need to assign timestamps to "D" lines.
type timestamp count
<chr> <dttm> <int>
1 $ NA NA
2 D NA 229
3 M NA NA
4 D NA 230
5 D NA 231
6 D NA 232
7 D NA 233
8 D NA 234
9 D NA 235
10 D NA 236
11 D NA 237
12 D NA 238
13 D NA 239
14 S 2024-01-24 16:11:11.000 NA
15 D NA 241
16 D NA 242
17 D NA 243
18 D NA 126
19 D NA 127
20 S 2024-01-24 16:13:29.000 NA
21 D NA 128
"Count" is a 1 byte iterator that goes from 0-255 and repeats. Missing counts indicate missing data lines. Data lines are sent at 16Hz, so each count iteration represents 1/16 sec. I'm trying to assign the correct timestamps using the counts of the D lines to get the nearest S line timestamp and calculate the timestamp by the difference in count between the current D line and the D line immediately following an S line. Typically the S lines are every second, but I picked this subset to show some of the issues with the data, mainly the gap of 2:18 at line 17.
I've figured out in a way that works, but is incredibly slow (4ms/row, need to process ~1M lines of data per day for files that span many days). The real data is in files with lines in several formats (ick), and the times and counts in this example are parsed out of that. This is beginning to sound like a adventofcode problem, but sadly, this system is real.
If you'd like to check out my slow solution or see more complete data, it's in this file in the repo: https://github.com/blongworth/mlabtools/blob/main/R/time_alignment.R The data above are simplified, so the method in the repo won't work on the reprex data without modification. There are tests, but not a set for how the result from this Reprex should look yet.
Any ideas for how to do this efficiently? I'll likely have to go to data.tables eventually, but as long as I have a start on more efficient logic, I think I can get there.
Here's dput output for the test df above:
structure(list(type = c("$", "D", "M", "D", "D", "D", "D", "D",
"D", "D", "D", "D", "D", "S", "D", "D", "D", "D", "D", "S", "D"
), timestamp = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 1706130671, NA, NA, NA, NA, NA, 1706130809, NA
), tzone = "America/New_York", class = c("POSIXct", "POSIXt")),
count = c(NA, 229L, NA, 230L, 231L, 232L, 233L, 234L, 235L,
236L, 237L, 238L, 239L, NA, 241L, 242L, 243L, 126L, 127L,
NA, 128L)), row.names = c(NA, -21L), class = c("tbl_df",
"tbl", "data.frame"))
Here's the example data with the expected output:
type timestamp count
<chr> <dttm> <int>
1 $ NA NA
2 D 2024-01-24 16:11:10.250 229
3 M NA NA
4 D 2024-01-24 16:11:10.312 230
5 D 2024-01-24 16:11:10.375 231
6 D 2024-01-24 16:11:10.437 232
7 D 2024-01-24 16:11:10.500 233
8 D 2024-01-24 16:11:10.562 234
9 D 2024-01-24 16:11:10.625 235
10 D 2024-01-24 16:11:10.687 236
11 D 2024-01-24 16:11:10.750 237
12 D 2024-01-24 16:11:10.812 238
13 D 2024-01-24 16:11:10.875 239
14 S 2024-01-24 16:11:11.000 NA
15 D 2024-01-24 16:11:11.000 241
16 D 2024-01-24 16:11:11.062 242
17 D 2024-01-24 16:11:11.125 243
18 D 2024-01-24 16:13:28.875 126
19 D 2024-01-24 16:13:28.937 127
20 S 2024-01-24 16:13:29.000 NA
21 D 2024-01-24 16:13:29.000 128
row_num
doesn't exist.) – Donniedonnish