I am working with 2 datasets. One describes some time windows by their start and stop times. The second one contains a big list of events with their corresponding timestamps.
I want to combine this into a single dataframe that contains the start and stop time of each window, together with how many events happened during this time window.
I have managed to "solve" my problem with:
import polars as pl
actions = {
"id": ["a", "a", "a", "a", "b", "b", "a", "a"],
"action": ["start", "stop", "start", "stop", "start", "stop", "start", "stop"],
"time": [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
}
events = {
"name": ["x", "x", "x", "y", "y", "z", "w", "w", "w"],
"time": [0.0, 0.1, 0.5, 1.1, 2.5, 3.0, 4.5, 4.9, 5.5],
}
actions_df = (
pl.DataFrame(actions)
.group_by("id")
.agg(
start=pl.col("time").filter(pl.col("action") == "start"),
stop=pl.col("time").filter(pl.col("action") == "stop"),
)
.explode(["start", "stop"])
)
df = (
actions_df.join(pl.DataFrame(events), how="cross")
.filter((pl.col("time") >= pl.col("start")) & (pl.col("time") <= pl.col("stop")))
.group_by(["id", "start", "stop", "name"])
.agg(count=pl.count("name"))
.pivot("name", index=["id", "start", "stop"], values="count")
.fill_null(0)
)
result_df = (
actions_df.join(df, on=["id", "start", "stop"], how="left")
.fill_null(0)
.sort("start")
)
print(result_df)
"""
┌─────┬───────┬──────┬─────┬─────┬─────┬─────┐
│ id ┆ start ┆ stop ┆ w ┆ y ┆ x ┆ z │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═════╪═══════╪══════╪═════╪═════╪═════╪═════╡
│ a ┆ 0.0 ┆ 1.0 ┆ 0 ┆ 0 ┆ 3 ┆ 0 │
│ a ┆ 2.0 ┆ 3.0 ┆ 0 ┆ 1 ┆ 0 ┆ 1 │
│ b ┆ 4.0 ┆ 5.0 ┆ 2 ┆ 0 ┆ 0 ┆ 0 │
│ a ┆ 6.0 ┆ 7.0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
└─────┴───────┴──────┴─────┴─────┴─────┴─────┘
"""
My issue is that this approach "explodes" in RAM and my process gets killed. I guess that the join(... how="cross")
makes my dataframe huge, just to then ignore most of it again.
Can I get some help/hints on a better way to solve this?
To give some orders of magnitude, my "actions" datasets have on the order of 100-500 time windows (~1 MB), and my "events" datasets have on the order of ~10 million (~200 MB). And I am getting my process killed with 16 GB of RAM.
EDIT In real data, my intervals can be overlapping. Thanks to @RomanPekar for bringing this up.
actions_df
non-overlapping? I see it's the case in the test example, but is it also true in real production data? – Leadwortactions_df.lazy().join(pl.LazyFrame(events), how="cross").filter(...).collect(streaming=True)
- but it's still not an "optimal" approach. – Grenadines