Consider the following data.table of events:
library(data.table)
breaks <- data.table(id = 1:8,
Channel = c("NP1", "NP1", "NP2", "NP2", "NP3", "NP3", "AT4", "AT4"),
Time = c(1000, 1100, 975, 1075, 1010, 1080, 1000, 1050),
Day = c(1, 1, 1, 1, 1, 1, 1, 1),
ZA = c(15, 12, 4, 2, 1, 2, 23, 18),
stringsAsFactors = F)
breaks
id Channel Time Day ZA
1: 1 NP1 1000 1 15
2: 2 NP1 1100 1 12
3: 3 NP2 975 1 4
4: 4 NP2 1075 1 2
5: 5 NP3 1010 1 1
6: 6 NP3 1080 1 2
7: 7 AT4 1000 1 23
8: 8 AT4 1050 1 18
For each unique event in breaks I want to find the nearest events in all other channels using the Time
variable where Day == Day
and then sum the values of ZA for these events.
This is the result I want to achieve:
id Channel Time Day ZA Sum
1: 1 NP1 1000 1 15 28
2: 2 NP1 1100 1 12 22
3: 3 NP2 975 1 4 39
4: 4 NP2 1075 1 2 32
5: 5 NP3 1010 1 1 42
6: 6 NP3 1080 1 2 32
7: 7 AT4 1000 1 23 20
8: 8 AT4 1050 1 18 19
So for the first row the channel is NP1. The closes events in all other channels to Time = 1000
are rows 3, 5 and 7. 4+1+23 = 28
I got this to work using data.table with the following code:
breaks[breaks[, c("Day", "Time", "Channel", "ZA")], on = "Day", allow.cartesian = TRUE][
Channel != i.Channel][
order(id)][
, delta := abs(Time - i.Time)][
, .SD[delta == min(delta)], by = .(Channel, Time, Day, i.Channel)][
, unique(.SD, by = c("id", "i.Channel"))][
, .(Sum = sum(i.ZA)), by = .(id, Channel, Time, Day, ZA)]
However, this creates a dataset with 64 rows in the first step and I'd like to do this with a dataset of more than a million rows.
Can anyone help me find a more efficient way of doing this?
Edit:
I tried out the solutions of G. Grothendieck (sqldf), eddi (data.table) and MarkusN (dplyr) once on the full dataset of 1.4 million rows with 39 different channels. The dataset was in-memory.
sqldf: 54 minutes
data.table: 11 hours
dplyr: 29 hours