How to get the value of a specified index number from the sorting of a column and fill it with null if missing?

Asked 14/8, 2024 at 3:1 Answered 14/8, 2024 at 7:13

import polars as pl

df = pl.DataFrame(
    {"name": list("abcdef"), "age": [21, 31, 32, 53, 45, 26], "country": list("AABBBC")}
)

df.group_by("country").agg(
    pl.col("name").sort_by("age").first().alias("age_sort_1"),
    pl.col("name").sort_by("age").get(2).alias("age_sort_2"),  # OutOfBoundsError: index out of bounds
    # pl.col("name").sort_by("age").arr.get(2, null_on_oob=True).alias("age_2"),
    # SchemaError: invalid series dtype: expected `FixedSizeList`, got `str`
    pl.col("name").sort_by("age").last().alias("age_sort_-1")
)

As shown in the code above, I want to get the name in each country whose age is in a specific order.

However, Expr.get does not provide the null_on_oob parameter. How to automatically fill in null when an out-of-bounds situation occurs?

In addition, the .arr.get method provides the null_on_oob parameter, but reports an error SchemaError: invalid series dtype: expected "FixedSizeList", got "str". I don’t know what this error refers to and how to solve it.

ps: The above code uses the repeated code pl.col("name").sort_by("age") many times. Is there a more concise method?

Fiann answered 14/8, 2024 at 3:1 Comment(0)

There's an open issue related to your question now - polars.Expr.take returns null if ComputeError: index out of bounds.

For now you can either use shift() (maintain_order = True just to make it more readable):

exp = pl.col.name.sort_by("age")

(
    df
    .group_by("country", maintain_order = True).agg(
        exp.first().alias("age_sort_1"),
        exp.shift(-2).first().alias("age_sort_2"),
        exp.last().alias("age_sort_-1"),
    )
)

┌─────────┬────────────┬────────────┬─────────────┐
│ country ┆ age_sort_1 ┆ age_sort_2 ┆ age_sort_-1 │
│ ---     ┆ ---        ┆ ---        ┆ ---         │
│ str     ┆ str        ┆ str        ┆ str         │
╞═════════╪════════════╪════════════╪═════════════╡
│ A       ┆ a          ┆ null       ┆ b           │
│ B       ┆ c          ┆ d          ┆ d           │
│ C       ┆ f          ┆ null       ┆ f           │
└─────────┴────────────┴────────────┴─────────────┘

Or, just aggregate your data into list, and then use .list.get() which allows you to use null_on_oob parameter:

(
    df
    .group_by("country").agg(
        pl.col.name.sort_by("age")
    )
    .with_columns(
        pl.col.name
        .list.get(i, null_on_oob = True).alias(f"age_sort_{c}")
        for i, c in [(0, 1), (2, 2), (-1, -1)]
    ).drop("name")
)

Alternatively you can use list.gather() to get the list of 3 elements you need, convert it to struct via list.to_struct() method and then unnest() it to columns:

(
    df
    .group_by("country").agg(
        pl.col.name.sort_by("age")
    )
    .with_columns(
        pl.col.name
        .list.gather([0,2,-1], null_on_oob = True)
        .list.to_struct(fields=["age_sort_1","age_sort_2","age_sort_-1"])
    ).unnest("name")
)

Danforth answered 14/8, 2024 at 7:13 Comment(0)

import polars as pl

df = pl.DataFrame(
    {"name": list("abcdef"), "age": [21, 31, 32, 53, 45, 26], "country": list("AABBBC")}
)

df.group_by("country").agg(
    pl.col("name").sort_by("age").first().alias("age_sort_1"),
    pl.col("name").sort_by("age").implode().list.get(2, null_on_oob=True).get(0).alias("age_sort_2"),  
    pl.col("name").sort_by("age").last().alias("age_sort_-1")
)

The above code can solve the problem.

But I don’t know why the return type of pl.col("name").sort_by("age").implode().list.get(2, null_on_oob=True) is list[str] instead of str and must perform .get(0) again to obtain the correct result.

Fiann answered 14/8, 2024 at 5:28 Comment(1)

github.com/pola-rs/polars/pull/6487. It is mentioned that: in .group_by it will be a list by default, so an additional .get(0) is needed. – Fiann 14/8, 2024 at 15:0

Recommended topics

Hot tags