I have a Polars dataframe with a lot of duplicate data I would like to consolidate.
Input:
shape: (3, 2)
┌─────┬──────┐
│ id ┆ data │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ a │
│ 1 ┆ b │
│ 1 ┆ c │
└─────┴──────┘
My current (non-working) solution:
df = pl.DataFrame({'id': [1, 1, 1], 'data': ['a', 'b', 'c']})
df = df.join(df.select('id', 'data'), on='id')
Output:
shape: (9, 3)
┌─────┬──────┬────────────┐
│ id ┆ data ┆ data_right │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪══════╪════════════╡
│ 1 ┆ a ┆ a │
│ 1 ┆ b ┆ a │
│ 1 ┆ c ┆ a │
│ 1 ┆ a ┆ b │
│ 1 ┆ b ┆ b │
│ 1 ┆ c ┆ b │
│ 1 ┆ a ┆ c │
│ 1 ┆ b ┆ c │
│ 1 ┆ c ┆ c │
└─────┴──────┴────────────┘
Desired output:
shape: (1, 4)
┌─────┬────────┬────────┬────────┐
│ id ┆ data_1 ┆ data_2 ┆ data_3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═════╪════════╪════════╪════════╡
│ 1 ┆ a ┆ b ┆ c │
└─────┴────────┴────────┴────────┘
It seems like a self join would be the way to get a table with all the columns I want, but I'm unsure how to write a self join that would include multiple columns instead of just a bunch of rows with only two, and looping through self joins does not seem like the correct thing to do as it quickly balloons in size. This isn't specifically a Polars problem, but I am working in Python-Polars
to_struct
set(n_field_strategy=“max_width”, fields=lambda x: f"data_{x}")
– Geralyngeraniaceous