Python-polars: how to multiply each element in a list with a value in a different column?
Asked Answered
G

3

6

I have a dataframe with a certain number of groups, containing a weight column and a list of values, which can be of arbitrary length, so for example:

df = pl.DataFrame(
    {
        "Group": ["Group1", "Group2", "Group3"],
        "Weight": [100.0, 200.0, 300.0],
        "Vals": [[0.5, 0.5, 0.8],[0.5, 0.5, 0.8], [0.7, 0.9]]
    }
)
┌────────┬────────┬─────────────────┐
│ Group  ┆ Weight ┆ Vals            │
│ ---    ┆ ---    ┆ ---             │
│ str    ┆ f64    ┆ list[f64]       │
╞════════╪════════╪═════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      │
└────────┴────────┴─────────────────┘

My goal is to calculate a 'weighted' column, which would be the multiple of each item in the values list with the value in the weight column:

┌────────┬────────┬─────────────────┬─────────────────┐
│ Group  ┆ Weight ┆ Vals            ┆ Weighted        │
│ ---    ┆ ---    ┆ ---             ┆ ---             │
│ str    ┆ f64    ┆ list[f64]       ┆ list[i64]       │
╞════════╪════════╪═════════════════╪═════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] ┆ [50, 50, 80]    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] ┆ [100, 100, 160] │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      ┆ [210, 270]      │
└────────┴────────┴─────────────────┴─────────────────┘

I've tried a few different things:

df.with_columns(
    pl.col("Vals").list.eval(pl.element() * 3).alias("Weight1"), #Multiplying with literal works
    pl.col("Vals").list.eval(pl.element() * pl.col("Weight")).alias("Weight2"), #Does not work
    pl.col("Vals").list.eval(pl.element() * pl.col("Unknown")).alias("Weight3"), #Unknown columns give same value
    pl.col("Vals").list.eval(pl.col("Vals") * pl.col("Weight")).alias("Weight4"), #Same effect
    # pl.col('Vals') * 3 -> gives an error
)
┌────────┬────────┬────────────┬────────────┬──────────────┬──────────────┬────────────────────┐
│ Group  ┆ Weight ┆ Vals       ┆ Weight1    ┆ Weight2      ┆ Weight3      ┆ Weight4            │
│ ---    ┆ ---    ┆ ---        ┆ ---        ┆ ---          ┆ ---          ┆ ---                │
│ str    ┆ f64    ┆ list[f64]  ┆ list[f64]  ┆ list[f64]    ┆ list[f64]    ┆ list[f64]          │
╞════════╪════════╪════════════╪════════════╪══════════════╪══════════════╪════════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, ┆ [1.5, 1.5, ┆ [0.25, 0.25, ┆ [0.25, 0.25, ┆ [0.25, 0.25, 0.64] │
│        ┆        ┆ 0.8]       ┆ 2.4]       ┆ 0.64]        ┆ 0.64]        ┆                    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, ┆ [1.5, 1.5, ┆ [0.25, 0.25, ┆ [0.25, 0.25, ┆ [0.25, 0.25, 0.64] │
│        ┆        ┆ 0.8]       ┆ 2.4]       ┆ 0.64]        ┆ 0.64]        ┆                    │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9] ┆ [2.1, 2.7] ┆ [0.49, 0.81] ┆ [0.49, 0.81] ┆ [0.49, 0.81]       │
└────────┴────────┴────────────┴────────────┴──────────────┴──────────────┴────────────────────┘

Unless I'm not understanding it correctly, it seems like you're unable to access columns outside of the list from within the eval function. Perhaps there might be a way to use list comprehension within the statement, but that doesn't really seem like a neat solution.

What would be the recommended approach here? Any help would be appreciated!

Girand answered 9/11, 2022 at 8:55 Comment(0)
A
3

EDIT - Polars update:

As of the latest version of Polars, this is now a the correct syntax:

df = pl.DataFrame(
    {
        "Group": ["Group1", "Group2", "Group3"],
        "Weight": [100.0, 200.0, 300.0],
        "Vals": [[0.5, 0.5, 0.8],[0.5, 0.5, 0.8], [0.7, 0.9]]
    }
)

(df
    .explode('Vals')
    .with_columns(Weighted = pl.col('Weight')*pl.col('Vals'))
    .group_by('Group')
    .agg(
        pl.col('Weight').first(),                                                                                                             
        pl.col('Vals'),
        pl.col('Weighted')
        )                                                                                                 
)
shape: (3, 4)
┌────────┬────────┬─────────────────┬───────────────────────┐
│ Group  ┆ Weight ┆ Vals            ┆ Weighted              │
│ ---    ┆ ---    ┆ ---             ┆ ---                   │
│ str    ┆ f64    ┆ list[f64]       ┆ list[f64]             │
╞════════╪════════╪═════════════════╪═══════════════════════╡
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      ┆ [210.0, 270.0]        │
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] ┆ [50.0, 50.0, 80.0]    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] ┆ [100.0, 100.0, 160.0] │
└────────┴────────┴─────────────────┴───────────────────────┘
Allege answered 9/11, 2022 at 15:35 Comment(3)
It does, thank you! Fyi, between this one and @alexp 's, this solution seems to be around 3x faster on my real data.Girand
The behavior of agg has changed. You should no longer call .list() on the expressions within the agg call. @Allege could you update your answer?Grantham
If the group column wasn't present here, would I need to introduce a separate dummy column to use for the groupby operation afterward? No smart way to circumvent this?Marva
J
2

So you can solve this in two ways. To be honest I don't know which one is better, I haven't tested for performance or RAM usage.

Algorithm 1

We can put both columns in a struct and then apply a custom function on them. (Which is explained in the Documentation here)

import polars as pl
import numpy as np

def weighted_list(ls, weight):
    return np.array(ls) * weight

df.with_columns(
    pl.struct("Weight", "Vals").map_elements(
        lambda x: weighted_list(x["Vals"], x["Weight"]), 
        return_dtype=pl.List(pl.Float64)
    )
)

Algorithm 2

So in this case you have to be careful that your columns "Group", "Weight" are unique. So if you have two entries for example with Group3 and Weight 300.

(df.explode("Vals")
   .with_columns((pl.col("Vals") * pl.col("Weight")).alias("Weighted"))
   .group_by("Group", "Weight")
   .agg("Vals", "Weighted")
)
Jillene answered 9/11, 2022 at 13:2 Comment(0)
R
0

updated. Since version 1.10.0 arithmetic operations between lists and scalars are supported:

df.with_columns(
    Weighted = pl.col.Vals * pl.col.Weight
)
shape: (3, 4)
┌────────┬────────┬─────────────────┬───────────────────────┐
│ Group  ┆ Weight ┆ Vals            ┆ Weighted              │
│ ---    ┆ ---    ┆ ---             ┆ ---                   │
│ str    ┆ f64    ┆ list[f64]       ┆ list[f64]             │
╞════════╪════════╪═════════════════╪═══════════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] ┆ [50.0, 50.0, 80.0]    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] ┆ [100.0, 100.0, 160.0] │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      ┆ [210.0, 270.0]        │
└────────┴────────┴─────────────────┴───────────────────────┘

outdated. Since version 1.8.0 arithmetic operations on lists are supported. You still need to convert scalar column into list column of appropriate length with pl.Expr.repeat_by() and pl.Expr.list.len().

df.with_columns(
    Weighted = pl.col.Vals * pl.col.Weight.repeat_by(pl.col.Vals.list.len())
)
shape: (3, 4)
┌────────┬────────┬─────────────────┬───────────────────────┐
│ Group  ┆ Weight ┆ Vals            ┆ Weighted              │
│ ---    ┆ ---    ┆ ---             ┆ ---                   │
│ str    ┆ f64    ┆ list[f64]       ┆ list[f64]             │
╞════════╪════════╪═════════════════╪═══════════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] ┆ [50.0, 50.0, 80.0]    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] ┆ [100.0, 100.0, 160.0] │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      ┆ [210.0, 270.0]        │
└────────┴────────┴─────────────────┴───────────────────────┘

outdated. Unfortunately, operations on lists are not yet supported, but you can work around it using Struct and avoid exploding and imploding the data. You can create a list out of column a and then get the result by multiplying Structs and converting it back to List.

(
    df
    .with_columns(Weighted = pl.col.Weight.repeat_by(pl.col.Vals.list.len()))
    .with_columns(pl.col.Weighted.list.to_struct() * pl.col.Vals.list.to_struct())
    .with_columns(Weighted = pl.concat_list(pl.col.Weighted.struct.field("*")))
    .with_columns(pl.col.Weighted.list.drop_nulls().cast(pl.List(pl.Int64)))
)
┌────────┬────────┬─────────────────┬─────────────────┐
│ Group  ┆ Weight ┆ Vals            ┆ Weighted        │
│ ---    ┆ ---    ┆ ---             ┆ ---             │
│ str    ┆ f64    ┆ list[f64]       ┆ list[i64]       │
╞════════╪════════╪═════════════════╪═════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] ┆ [50, 50, 80]    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] ┆ [100, 100, 160] │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      ┆ [210, 270]      │
└────────┴────────┴─────────────────┴─────────────────┘
Rod answered 21/9 at 19:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.