How to explode multiple List[_] columns with missing values in python polars?

Asked 30/7, 2024 at 7:59 Answered 30/7, 2024 at 19:17

Given a Polars dataframe like below, how can I call explode() on both columns while expanding the null entry to the correct length to match up with its row?

shape: (3, 2)
┌───────────┬─────────────────────┐
│ x         ┆ y                   │
│ ---       ┆ ---                 │
│ list[i64] ┆ list[bool]          │
╞═══════════╪═════════════════════╡
│ [1]       ┆ [true]              │
│ [1, 2]    ┆ null                │
│ [1, 2, 3] ┆ [true, false, true] │
└───────────┴─────────────────────┘

Currently calling df.explode(["x", "y"]) will result in this error.

polars.exceptions.ShapeError: exploded columns must have matching element counts

I'm assuming there's not a built-in way. But I can't find/think of a way to convert that null into a list of correct length, such that the explode will work. Here, the required length is not known statically upfront.

I looked into passing list.len() expressions into repeat_by(), but repeat_by() doesn't support null.

Effie answered 30/7, 2024 at 7:59 Comment(0)

You were on the right track, trying to fill the missing values with a list of null values of correct length.

To make pl.Expr.repeat_by work with null, we need to ensure that the base expression is of a non-null type. This can be achieved by setting the dtype argument of pl.lit explicity.

Then, the list column of (lists of) nulls can be used to fill the null values in y. From there, exploding x and y simultaneously works as usually.

(
    df
    .with_columns(
        pl.col("y").fill_null(
            pl.lit(None, dtype=pl.Boolean).repeat_by(pl.col("x").list.len())
        )
    )
)

shape: (3, 2)
┌───────────┬─────────────────────┐
│ x         ┆ y                   │
│ ---       ┆ ---                 │
│ list[i64] ┆ list[bool]          │
╞═══════════╪═════════════════════╡
│ [1]       ┆ [true]              │
│ [1, 2]    ┆ [null, null]        │
│ [1, 2, 3] ┆ [true, false, true] │
└───────────┴─────────────────────┘

From here, df.explode("x", "y") should work as expected.

Note. If there are more than two columns, which all might contain null values, one can combine the answer above with this answer to have a valid solution.

Note.

Thunderstruck answered 30/7, 2024 at 8:13 Comment(4)

is there a reason to use pl.Int64 and not pl.Boolean as dtype? – Nonlinearity 30/7, 2024 at 8:32

@RomanPekar Thanks, I was accidentally looking at the x column - fixed! – Thunderstruck 30/7, 2024 at 8:34

you forgot to put the explode('x','y') at the end. – Horseman 30/7, 2024 at 19:37

@DeanMacGregor Initially, I deliberately posted the answer as is since OP asked for a way to convert the null to a list of correct length, such that .explode("x", "y") would work. However, I'll add a brief note at the end. Thanks! – Thunderstruck 30/7, 2024 at 19:56

See performance at bottom:

I like the elegance and intuitiveness of the repeat_by approach but I'm a glutton for punishment so here's an approach that splits the data up by the condition and then puts it back together. It is worse than the simple approach but might be helpful for another operation/use case.

pl.concat(
    [
        part.lazy().select("i", "x", pl.lit(None, pl.Boolean).alias("y")).explode("x")
        if isnull[0]
        else part.lazy().explode("x", "y")
        for isnull, part in df.with_row_index("i").group_by(
            pl.col("y").is_null(), maintain_order=True
        )
    ]
).sort("i").drop("i").collect()

This one has an added with_row_index so you can maintain the original order but if the order isn't important you can remove that as well as the subsequent sort/drop. It also turns the parts lazy and collects at the end. This is because if you concat multiple lazyframes, it will run each of their plans in parallel. Again, if this isn't important you can remove the 2 .lazy()s and the .collect().

If you're starting from a lazy frame then you can't use group_by as an iterator directly but you can use map_groups to get the same effect.

You have to make a function such as:

def part_explode(part: pl.DataFrame):
    if part.select(pl.col('x').first().list.len()==pl.col('y').first().list.len()).item():
        return part.explode('x','y')
    else:
        return part.with_columns(pl.lit(None, pl.Boolean).alias('y')).explode('x')

and then you do

df.group_by(pl.col("y").is_null(), maintain_order=True).map_groups(
    part_explode, schema={"i": pl.UInt32, "x": pl.Int64, "y": pl.Boolean}
).sort('i').drop('i').collect()

I don't think map_groups will parallelize the parts since it relies on executing the python function so don't use this approach unless you're starting from lazy and don't have the memory to materialize first.

Performance

Setting up with

import polars as pl
import numpy as np
n=1_000_000
df=pl.DataFrame({
    'x':np.random.randint(0,10,n),
    'y':np.random.randint(0,2,n),
    'group':np.random.randint(0, 100_000, n),
}).with_columns(pl.col('y').cast(pl.Boolean)).group_by('group').agg('x','y').with_columns(
    y=pl.when(pl.col('group').mod(20)==0).then(pl.lit(None)).otherwise('y')
).drop('group')

and then the tests

%%timeit
(
    df
    .with_columns(
        pl.col("y").fill_null(
            pl.lit(None, dtype=pl.Boolean).repeat_by(pl.col("x").list.len())
        )
    ).explode('x','y')
)
31.7 ms ± 5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

vs the concat thing from above

84.1 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Horseman answered 30/7, 2024 at 19:17 Comment(0)

See performance at bottom:

Performance

Recommended topics

Hot tags