Fast apply of a function to Polars Dataframe
Asked Answered
A

2

6

What are the fastest ways to apply functions to polars DataFrames - pl.DataFrame or pl.internals.lazy_frame.LazyFrame? This question is piggy-backing off Apply Function to all columns of a Polars-DataFrame

I am trying to concat all columns and hash the value using hashlib in python standard library. The function I am using is below:

import hashlib

def hash_row(row):
    os.environ['PYTHONHASHSEED'] = "0"
    row = str(row).encode('utf-8')
    return hashlib.sha256(row).hexdigest()

However given that this function requires a string as input, means this function needs to be applied to every cell within a pl.Series. Working with a small amount of data, should be okay, but when we have closer to 100m rows this becomes very problematic. The question for this thread is how can we apply such a function in the most-performant way across an entire Polars Series?

Pandas

Offers a few options to create new columns, and some are more performant than others.

df['new_col'] = df['some_col'] * 100 # vectorized calls

Another option is to create custom functions for row-wise operations.

def apply_func(row):
   return row['some_col'] + row['another_col']

df['new_col'] = df.apply(lambda row: apply_func(row), axis=1) # using apply operations

From my experience, the fastest way is to create numpy vectorized solutions.

import numpy as np

def np_func(some_col, another_col):
   return some_col + another_col

vec_func = np.vectorize(np_func)

df['new_col'] = vec_func(df['some_col'].values, df['another_col'].values)

Polars

What is the best solution for Polars?

Ascension answered 9/6, 2022 at 16:17 Comment(0)
T
8

Let's start with this data of various types:

import polars as pl

df = pl.DataFrame(
    {
        "col_int": [1, 2, 3, 4],
        "col_float": [10.0, 20, 30, 40],
        "col_bool": [True, False, True, False],
        "col_str": ["2020-01-01"] * 4
    }
).with_columns(pl.col("col_str").str.to_date().alias("col_date"))
df
shape: (4, 5)
┌─────────┬───────────┬──────────┬────────────┬────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str    ┆ col_date   │
│ ---     ┆ ---       ┆ ---      ┆ ---        ┆ ---        │
│ i64     ┆ f64       ┆ bool     ┆ str        ┆ date       │
╞═════════╪═══════════╪══════════╪════════════╪════════════╡
│ 1       ┆ 10.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 │
│ 2       ┆ 20.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 │
│ 3       ┆ 30.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 │
│ 4       ┆ 40.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┘

Polars: DataFrame.hash_rows

I should first point out that Polars itself has a hash_rows function that will hash the rows of a DataFrame, without first needing to cast each column to a string.

df.hash_rows()
shape: (4,)
Series: '' [u64]
[
        16206777682454905786
        7386261536140378310
        3777361287274669406
        675264002871508281
]

If you find this acceptable, then this would be the most performant solution. You can cast the resulting unsigned int to a string if you need to. Note: hash_rows is available only on a DataFrame, not a LazyFrame.

Using polars.concat_str and map_elements

If you need to use your own hashing solution, then I recommend using the polars.concat_str function to concatenate the values in each row to a string. From the documentation:

polars.concat_str(exprs: Union[Sequence[Union[polars.internals.expr.Expr, str]], polars.internals.expr.Expr], sep: str = '') → polars.internals.expr.Expr

Horizontally Concat Utf8 Series in linear time. Non utf8 columns are cast to utf8.

So, for example, here is the resulting concatenation on our dataset.

df.with_columns(
    pl.concat_str(pl.all()).alias('concatenated_cols')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str    ┆ col_date   ┆ concatenated_cols              │
│ ---     ┆ ---       ┆ ---      ┆ ---        ┆ ---        ┆ ---                            │
│ i64     ┆ f64       ┆ bool     ┆ str        ┆ date       ┆ str                            │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪════════════════════════════════╡
│ 1       ┆ 10.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 ┆ 110.0true2020-01-012020-01-01  │
│ 2       ┆ 20.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 ┆ 220.0false2020-01-012020-01-01 │
│ 3       ┆ 30.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 ┆ 330.0true2020-01-012020-01-01  │
│ 4       ┆ 40.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 ┆ 440.0false2020-01-012020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┴────────────────────────────────┘

Taking the next step and using the map_elements method and your function would yield:

df.with_columns(
    pl.concat_str(pl.all()).map_elements(hash_row).alias('hash')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬─────────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str    ┆ col_date   ┆ hash                                │
│ ---     ┆ ---       ┆ ---      ┆ ---        ┆ ---        ┆ ---                                 │
│ i64     ┆ f64       ┆ bool     ┆ str        ┆ date       ┆ str                                 │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪═════════════════════════════════════╡
│ 1       ┆ 10.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 ┆ 1826eb9c6aeb0abcdd2999a76eee576e... │
│ 2       ┆ 20.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 ┆ ea50f5b11957bfc92b5ab7545b3ac12c... │
│ 3       ┆ 30.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 ┆ eef039d8dedadcc282d6fa9473e071e8... │
│ 4       ┆ 40.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 ┆ dcc5c57e0b5fdf15320a84c6839b0e3d... │
└─────────┴───────────┴──────────┴────────────┴────────────┴─────────────────────────────────────┘

Please remember that any time Polars calls external libraries or runs Python bytecode, you are subject to the Python GIL, which means single-threaded performance - no matter how you code it. From the Polars User Guide section Do Not Kill The Parallelization!:

We have all heard that Python is slow, and does "not scale." Besides the overhead of running "slow" bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars speed is capped running Python code preventing any multiple threads from executing the function.

This all feels terribly limiting, especially because we often need those lambda functions in a .groupby() step, for example. This approach is still supported by Polars, but keeping in mind bytecode and the GIL costs have to be paid.

To mitigate this, Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API.

Tabitha answered 9/6, 2022 at 17:39 Comment(1)
apply does not work for me. I get a discontinued replaced by map_elements.Pyrotechnics
A
1

Thanks cbilot - was unaware of hash_rows. Your solution is nearly identical to what I have wrote. The one thing that I have to mention is that --

concat_str did not work for me if there are Nulls in your series. Thus I had to cast to String before fill_null. Then I am able to concat_str and apply hash_row on the result.

def set_datatypes_and_replace_nulls(df, idcol="factset_person_id"):
    return (
        df
        .with_columns(
            pl.all().cast(pl.String, strict=False),
            pl.all().fill_null(pl.lit(""))
        )
        .with_columns(
            pl.concat_str(pl.exclude(exclude_cols)).alias('concatstr'),
    )
)
def hash_concat(df):
    return (
        df
        .with_columns(
            pl.col("concatstr").map_elements(hash_row).alias('sha256hash')
        )
    )

After this we need to aggregate the hashes by ID.


df = (
    df
.pipe(set_datatypes_and_replace_nulls)
.pipe(hash_concat)
)

# something like the below...
part1= (
    df.lazy()
        .group_by("id")
        .agg(
            pl.col("concatstr").unique().list(),
        )
    )
)

Thanks for improving with pl.hash_rows.

Ascension answered 9/6, 2022 at 19:30 Comment(2)
Wow. Didn’t know about the issue with nulls when using concat_str. Thanks for sharing that.Tabitha
Np. We also should pass PYTHONHASHSEED as a seed parameter to pl.hash_rows(SEED1, ..., SEED4) or pl.Expr.hash(SEED_PARAMETER) this was mentioned by Ritchie in Github issue. Hash in polars uses ahash under the hood.Ascension

© 2022 - 2025 — McMap. All rights reserved.