How can I consolidate all rows with the same ID in Polars?
Asked Answered
L

3

5

I have a Polars dataframe with a lot of duplicate data I would like to consolidate.

Input:

shape: (3, 2)
┌─────┬──────┐
│ id  ┆ data │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 1   ┆ a    │
│ 1   ┆ b    │
│ 1   ┆ c    │
└─────┴──────┘

My current (non-working) solution:

df = pl.DataFrame({'id': [1, 1, 1], 'data': ['a', 'b', 'c']})
df = df.join(df.select('id', 'data'), on='id')

Output:

shape: (9, 3)
┌─────┬──────┬────────────┐
│ id  ┆ data ┆ data_right │
│ --- ┆ ---  ┆ ---        │
│ i64 ┆ str  ┆ str        │
╞═════╪══════╪════════════╡
│ 1   ┆ a    ┆ a          │
│ 1   ┆ b    ┆ a          │
│ 1   ┆ c    ┆ a          │
│ 1   ┆ a    ┆ b          │
│ 1   ┆ b    ┆ b          │
│ 1   ┆ c    ┆ b          │
│ 1   ┆ a    ┆ c          │
│ 1   ┆ b    ┆ c          │
│ 1   ┆ c    ┆ c          │
└─────┴──────┴────────────┘

Desired output:

shape: (1, 4)
┌─────┬────────┬────────┬────────┐
│ id  ┆ data_1 ┆ data_2 ┆ data_3 │
│ --- ┆ ---    ┆ ---    ┆ ---    │
│ str ┆ str    ┆ str    ┆ str    │
╞═════╪════════╪════════╪════════╡
│ 1   ┆ a      ┆ b      ┆ c      │
└─────┴────────┴────────┴────────┘

It seems like a self join would be the way to get a table with all the columns I want, but I'm unsure how to write a self join that would include multiple columns instead of just a bunch of rows with only two, and looping through self joins does not seem like the correct thing to do as it quickly balloons in size. This isn't specifically a Polars problem, but I am working in Python-Polars

Leesen answered 28/8 at 18:17 Comment(0)
K
5

The following gives you a df in the format you want:

import polars as pl

df = (
    pl.DataFrame({"id": [1, 1, 1], "data": ["a", "b", "c"]})
    .group_by("id")
    .agg(pl.col("data"))
    .with_columns(structure=pl.col("data").list.to_struct())
    .unnest("structure")
    .drop("data")
)

print(df)
"""
┌─────┬─────────┬─────────┬─────────┐
│ id  ┆ field_0 ┆ field_1 ┆ field_2 │
│ --- ┆ ---     ┆ ---     ┆ ---     │
│ i64 ┆ str     ┆ str     ┆ str     │
╞═════╪═════════╪═════════╪═════════╡
│ 1   ┆ a       ┆ b       ┆ c       │
└─────┴─────────┴─────────┴─────────┘
"""
Kingsley answered 28/8 at 19:9 Comment(2)
This is basically as good as it gets. The only addition is in to_struct set (n_field_strategy=“max_width”, fields=lambda x: f"data_{x}")Geralyngeraniaceous
also do maintain_order=True in the group_byGeralyngeraniaceous
R
2

You can use with_columns and group_by on the 'id' column and aggregate the 'data' column into lists, you can now create new columns by selecting elements from these lists using list.get().

Finally, you can use n_unique to count unique values.

# df 
df = pl.DataFrame({"id": [1, 1, 1], "data": ["a", "b", "c"]})

# group by 'ids' and aggregate 'data' into lists
grouped = df.group_by("id").agg(pl.col("data").alias("data_list"))

# create columns by selecting elements from the list
result = grouped.with_columns(
    [
        pl.col("data_list").list.get(i).alias(f"data_{i+1}")
        for i in range(df["data"].n_unique())
    ]
).drop("data_list")

print(result)
┌─────┬────────┬────────┬────────┐
│ id  ┆ data_1 ┆ data_2 ┆ data_3 │
│ --- ┆ ---    ┆ ---    ┆ ---    │
│ i64 ┆ str    ┆ str    ┆ str    │
╞═════╪════════╪════════╪════════╡
│ 1   ┆ a      ┆ b      ┆ c      │
└─────┴────────┴────────┴────────┘
Restrictive answered 28/8 at 19:38 Comment(0)
S
1

The "long to wide" transformation does have a name: .pivot()

You can generate the new column names using a "row number" per group.

df.with_columns(
   row_number = pl.int_range(pl.len()).over("id"),
   name = pl.format("data_{}", pl.int_range(pl.len()).over("id") + 1)
)
shape: (5, 4)
┌─────┬──────┬────────────┬────────┐
│ id  ┆ data ┆ row_number ┆ name   │
│ --- ┆ ---  ┆ ---        ┆ ---    │
│ i64 ┆ str  ┆ i64        ┆ str    │
╞═════╪══════╪════════════╪════════╡
│ 1   ┆ a    ┆ 0          ┆ data_1 │
│ 1   ┆ b    ┆ 1          ┆ data_2 │
│ 1   ┆ c    ┆ 2          ┆ data_3 │
│ 2   ┆ d    ┆ 0          ┆ data_1 │
│ 2   ┆ e    ┆ 1          ┆ data_2 │
└─────┴──────┴────────────┴────────┘

We use pl.format() to prefix with data_ and add 1.

(df.with_columns(name = pl.format("data_{}", pl.int_range(pl.len()).over("id") + 1))
   .pivot("name", index="id")
)
shape: (2, 4)
┌─────┬────────┬────────┬────────┐
│ id  ┆ data_1 ┆ data_2 ┆ data_3 │
│ --- ┆ ---    ┆ ---    ┆ ---    │
│ i64 ┆ str    ┆ str    ┆ str    │
╞═════╪════════╪════════╪════════╡
│ 1   ┆ a      ┆ b      ┆ c      │
│ 2   ┆ d      ┆ e      ┆ null   │
└─────┴────────┴────────┴────────┘
Seagraves answered 29/8 at 7:54 Comment(1)
if you change your group_row to pl.lit("data_")+pl.int_range(1,pl.len()+1).over('id').cast(pl.String) then the output is the flavor they're looking for.Geralyngeraniaceous

© 2022 - 2024 — McMap. All rights reserved.