Explode multiple columns with different lengths
Asked Answered
R

2

7

I have a dataframe like:

data = {
    "a": [[1], [2], [3, 4], [5, 6, 7]],
    "b": [[], [8], [9, 10], [11, 12]],
}
df = pl.DataFrame(data)
"""
┌───────────┬───────────┐
│ a         ┆ b         │
│ ---       ┆ ---       │
│ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╡
│ [1]       ┆ []        │
│ [2]       ┆ [8]       │
│ [3, 4]    ┆ [9, 10]   │
│ [5, 6, 7] ┆ [11, 12]  │
└───────────┴───────────┘
"""

Each pair of lists may not have the same length, and I want to "truncate" the explode to the shortest of both lists:

"""
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 8   │
│ 3   ┆ 9   │
│ 4   ┆ 10  │
│ 5   ┆ 11  │
│ 6   ┆ 12  │
└─────┴─────┘
"""

I was thinking that maybe I'd have to fill the shortest of both lists with None to match both lengths, and then drop_nulls. But I was wondering if there was a more direct approach to this?

Ronn answered 15/9 at 17:45 Comment(1)
If there was a null in a list, what would happen? e.g. if [3, 4] was changed to [None, 4] in the example? Are there other non-list columns in the frame?Burton
A
6

Here's one approach:

min_length = pl.min_horizontal(pl.col('a', 'b').list.len())

out = (df.filter(min_length != 0)
       .with_columns(
           pl.col('a', 'b').list.head(min_length)
           )
       .explode('a', 'b')
       )

Output:

shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 8   │
│ 3   ┆ 9   │
│ 4   ┆ 10  │
│ 5   ┆ 11  │
│ 6   ┆ 12  │
└─────┴─────┘

Explanation

Adown answered 15/9 at 18:11 Comment(3)
Why is the filter nexessary?Pneumonoultramicroscopicsilicovolcanoconiosis
@Hericks: the first row has an empty list ([]), so without the filter that would lead to a row with [ null, null ], which the OP doesn't want.Adown
Purely a matter of preference but you could trade the filter at the beginning with drop_nulls at the end (assumes there aren't other nulls you want to keep). Also, the .explode could go at the end of the expr instead of the df that way you don't have to type the 'a','b' again.Director
H
2

I can't squeeze it in the comments, so I'll just put another answer here as it's a bit more generic for given example:

(
    df.with_columns(
        (x := pl.all().list).head(
            pl.min_horizontal(x.len())
        )
    ).explode("*")
    .drop_nulls()
)
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 8   │
│ 3   ┆ 9   │
│ 4   ┆ 10  │
│ 5   ┆ 11  │
│ 6   ┆ 12  │
└─────┴─────┘
Hagiolatry answered 16/9 at 6:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.