I need to write my own expression in polars_lazy. Based on my understanding from the source code I need to write a function that returns Expr::Function. The problem is that in order to construct an object of this type, an object of type FunctionOptions must be provided. The caveat is that this class is public but the members are pub(crate) and thus outside of the create one cannot construct such an object. Are there ways around this?
Personally I think the Rust API for polars is not well documented enough to really use yet. Although the other answer and comments mention apply
and map
, they don't mention how or the trade-offs. I hope this answer prompts others to correct me with the "right" way to do things.
So first, here's how to use apply
on lazy dataframe, even though lazy dataframes don't take apply
directly as a method as eager ones do, and mutating in-place:
// not sure how you'd find this type easily from apply documentation
let o = GetOutput::from_type(DataType::UInt32);
// this mutates two in place
let lf = lf.with_column(col("two").apply(str_to_len, o));
And here's how to use it while not mutating the source column and adding a new output column instead:
let o = GetOutput::from_type(DataType::UInt32);
// this adds new column len, two is unchanged
let lf = lf.with_column(col("two").alias("len").apply(str_to_len, o));
With the str_to_len
looking like:
fn str_to_len(str_val: Series) -> Result<Series> {
let x = str_val
.utf8()
.unwrap()
.into_iter()
// your actual custom function would be in this map
.map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.len() as u32))
.collect::<UInt32Chunked>();
Ok(x.into_series())
}
Note that it takes Series
rather than &Series
and wraps in Result
.
With a regular (non-lazy) dataframe, apply
still mutates but doesn't require with_column
:
df.apply("two", str_to_len).expect("applied");
Whereas eager/non-lazy's with_column
doesn't require apply
:
// the fn we use to make the column names it too
df.with_column(str_to_len(df.column("two").expect("has two"))).expect("with_column");
And str_to_len
has slightly different signature:
fn str_to_len(str_val: &Series) -> Series {
let mut x = str_val
.utf8()
.unwrap()
.into_iter()
.map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.len() as u32))
.collect::<UInt32Chunked>();
// NB. this is naming the chunked array, before we even get to a series
x.rename("len");
x.into_series()
}
I know there's reasons to have lazy and eager operate differently, but I wish the Rust documentation made this easier to figure out.
use polars::export::rayon::iter::ParallelIterator;
and then replace .into_iter() with .par_ier() ... do you think such parallisation would be benefitial for performance? –
Donative map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.len() as u32))
? Why not map(|opt_name: Option<&str>| opt_name.unwrap().len())
? –
Dumbhead unwrap
unwrapping a None
, so the inner map
protects me from that. Could instead use filter_map
too. @Anatoly Bugakov - what parallelization helps depends on what work your function is doing. –
Interne I don't think you're meant to directly construct Expr
s. Instead, you can use functions like polars_lazy::dsl::col()
and polars_lazy::dsl::lit()
to create expressions, then use methods on Expr
to build up the expression. Several of those methods, such as map()
and apply()
, will give you an Expr::Function
.
© 2022 - 2024 — McMap. All rights reserved.