See performance at bottom:
I like the elegance and intuitiveness of the repeat_by
approach but I'm a glutton for punishment so here's an approach that splits the data up by the condition and then puts it back together. It is worse than the simple approach but might be helpful for another operation/use case.
pl.concat(
[
part.lazy().select("i", "x", pl.lit(None, pl.Boolean).alias("y")).explode("x")
if isnull[0]
else part.lazy().explode("x", "y")
for isnull, part in df.with_row_index("i").group_by(
pl.col("y").is_null(), maintain_order=True
)
]
).sort("i").drop("i").collect()
This one has an added with_row_index
so you can maintain the original order but if the order isn't important you can remove that as well as the subsequent sort/drop. It also turns the part
s lazy and collects at the end. This is because if you concat multiple lazyframes, it will run each of their plans in parallel. Again, if this isn't important you can remove the 2 .lazy()
s and the .collect()
.
If you're starting from a lazy frame then you can't use group_by
as an iterator directly but you can use map_groups
to get the same effect.
You have to make a function such as:
def part_explode(part: pl.DataFrame):
if part.select(pl.col('x').first().list.len()==pl.col('y').first().list.len()).item():
return part.explode('x','y')
else:
return part.with_columns(pl.lit(None, pl.Boolean).alias('y')).explode('x')
and then you do
df.group_by(pl.col("y").is_null(), maintain_order=True).map_groups(
part_explode, schema={"i": pl.UInt32, "x": pl.Int64, "y": pl.Boolean}
).sort('i').drop('i').collect()
I don't think map_groups
will parallelize the parts since it relies on executing the python function so don't use this approach unless you're starting from lazy and don't have the memory to materialize first.
Performance
Setting up with
import polars as pl
import numpy as np
n=1_000_000
df=pl.DataFrame({
'x':np.random.randint(0,10,n),
'y':np.random.randint(0,2,n),
'group':np.random.randint(0, 100_000, n),
}).with_columns(pl.col('y').cast(pl.Boolean)).group_by('group').agg('x','y').with_columns(
y=pl.when(pl.col('group').mod(20)==0).then(pl.lit(None)).otherwise('y')
).drop('group')
and then the tests
%%timeit
(
df
.with_columns(
pl.col("y").fill_null(
pl.lit(None, dtype=pl.Boolean).repeat_by(pl.col("x").list.len())
)
).explode('x','y')
)
31.7 ms ± 5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
vs the concat thing from above
84.1 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pl.Int64
and notpl.Boolean
asdtype
? – Nonlinearity