Let's start with this data of various types:
import polars as pl
df = pl.DataFrame(
{
"col_int": [1, 2, 3, 4],
"col_float": [10.0, 20, 30, 40],
"col_bool": [True, False, True, False],
"col_str": ["2020-01-01"] * 4
}
).with_columns(pl.col("col_str").str.to_date().alias("col_date"))
df
shape: (4, 5)
┌─────────┬───────────┬──────────┬────────────┬────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date │
╞═════════╪═══════════╪══════════╪════════════╪════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┘
Polars: DataFrame.hash_rows
I should first point out that Polars itself has a hash_rows
function that will hash the rows of a DataFrame, without first needing to cast each column to a string.
df.hash_rows()
shape: (4,)
Series: '' [u64]
[
16206777682454905786
7386261536140378310
3777361287274669406
675264002871508281
]
If you find this acceptable, then this would be the most performant solution. You can cast the resulting unsigned int to a string if you need to. Note: hash_rows
is available only on a DataFrame, not a LazyFrame.
Using polars.concat_str
and map_elements
If you need to use your own hashing solution, then I recommend using the polars.concat_str
function to concatenate the values in each row to a string. From the documentation:
polars.concat_str(exprs: Union[Sequence[Union[polars.internals.expr.Expr, str]], polars.internals.expr.Expr], sep: str = '') → polars.internals.expr.Expr
Horizontally Concat Utf8 Series in linear time. Non utf8 columns are cast to utf8.
So, for example, here is the resulting concatenation on our dataset.
df.with_columns(
pl.concat_str(pl.all()).alias('concatenated_cols')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date ┆ concatenated_cols │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date ┆ str │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪════════════════════════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 110.0true2020-01-012020-01-01 │
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ 220.0false2020-01-012020-01-01 │
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 330.0true2020-01-012020-01-01 │
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ 440.0false2020-01-012020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┴────────────────────────────────┘
Taking the next step and using the map_elements
method and your function would yield:
df.with_columns(
pl.concat_str(pl.all()).map_elements(hash_row).alias('hash')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬─────────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date ┆ hash │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date ┆ str │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪═════════════════════════════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 1826eb9c6aeb0abcdd2999a76eee576e... │
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ ea50f5b11957bfc92b5ab7545b3ac12c... │
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ eef039d8dedadcc282d6fa9473e071e8... │
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ dcc5c57e0b5fdf15320a84c6839b0e3d... │
└─────────┴───────────┴──────────┴────────────┴────────────┴─────────────────────────────────────┘
Please remember that any time Polars calls external libraries or runs Python bytecode, you are subject to the Python GIL, which means single-threaded performance - no matter how you code it. From the Polars User Guide section Do Not Kill The Parallelization!:
We have all heard that Python is slow, and does "not scale." Besides the overhead of running "slow" bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars speed is capped running Python code preventing any multiple threads from executing the function.
This all feels terribly limiting, especially because we often need those lambda functions in a .groupby() step, for example. This approach is still supported by Polars, but keeping in mind bytecode and the GIL costs have to be paid.
To mitigate this, Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API.