Writing expression in polars-lazy in rust
Asked Answered
P

2

5

I need to write my own expression in polars_lazy. Based on my understanding from the source code I need to write a function that returns Expr::Function. The problem is that in order to construct an object of this type, an object of type FunctionOptions must be provided. The caveat is that this class is public but the members are pub(crate) and thus outside of the create one cannot construct such an object. Are there ways around this?

Parthena answered 16/12, 2021 at 23:35 Comment(0)
I
4

Personally I think the Rust API for polars is not well documented enough to really use yet. Although the other answer and comments mention apply and map, they don't mention how or the trade-offs. I hope this answer prompts others to correct me with the "right" way to do things.

So first, here's how to use apply on lazy dataframe, even though lazy dataframes don't take apply directly as a method as eager ones do, and mutating in-place:

// not sure how you'd find this type easily from apply documentation
let o = GetOutput::from_type(DataType::UInt32);
// this mutates two in place
let lf = lf.with_column(col("two").apply(str_to_len, o));

And here's how to use it while not mutating the source column and adding a new output column instead:

let o = GetOutput::from_type(DataType::UInt32);
// this adds new column len, two is unchanged
let lf = lf.with_column(col("two").alias("len").apply(str_to_len, o));

With the str_to_len looking like:

fn str_to_len(str_val: Series) -> Result<Series> {
    let x = str_val
        .utf8()
        .unwrap()
        .into_iter()
        // your actual custom function would be in this map
        .map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.len() as u32))
        .collect::<UInt32Chunked>();
    Ok(x.into_series())
}

Note that it takes Series rather than &Series and wraps in Result.

With a regular (non-lazy) dataframe, apply still mutates but doesn't require with_column:

df.apply("two", str_to_len).expect("applied");

Whereas eager/non-lazy's with_column doesn't require apply:

// the fn we use to make the column names it too
df.with_column(str_to_len(df.column("two").expect("has two"))).expect("with_column");

And str_to_len has slightly different signature:

fn str_to_len(str_val: &Series) -> Series {
    let mut x = str_val
        .utf8()
        .unwrap()
        .into_iter()
        .map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.len() as u32))
        .collect::<UInt32Chunked>();
    // NB. this is naming the chunked array, before we even get to a series
    x.rename("len");
    x.into_series()
}

I know there's reasons to have lazy and eager operate differently, but I wish the Rust documentation made this easier to figure out.

Interne answered 27/3, 2022 at 16:19 Comment(4)
Great examples, thanks! Is there need/functionality to parallelize apply/map ?Donative
ie you could ido use polars::export::rayon::iter::ParallelIterator; and then replace .into_iter() with .par_ier() ... do you think such parallisation would be benefitial for performance?Donative
Could you explain why to use map twice in map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.len() as u32))? Why not map(|opt_name: Option<&str>| opt_name.unwrap().len())?Dumbhead
Re @Dumbhead - I don't want to throw via unwrap unwrapping a None, so the inner map protects me from that. Could instead use filter_map too. @Anatoly Bugakov - what parallelization helps depends on what work your function is doing.Interne
P
3

I don't think you're meant to directly construct Exprs. Instead, you can use functions like polars_lazy::dsl::col() and polars_lazy::dsl::lit() to create expressions, then use methods on Expr to build up the expression. Several of those methods, such as map() and apply(), will give you an Expr::Function.

Pedicab answered 17/12, 2021 at 8:57 Comment(3)
But I need an expression that is not available in polars_lazy. A have an utf8 column and need to do some custom massaging that cannot be obtained by combining existing expressions.Parthena
I have 'partially' solved this by using eager but it would be better to do it in lazy.Parthena
As the awnser correctly states. You can use the apply expression to apply a custom closure over your string data.Carleton

© 2022 - 2024 — McMap. All rights reserved.