Iterate over rows polars rust
Asked Answered
P

2

7

I am trying to iterate over each row of a Polars rust dataframe.

In this endeavour, I have found df.get but the documentation says that this is slow. Then I have tried df.column("col").get but this seems to pose similar problems.

What is the correct way to process each row of the dataframe? I need to upload it to a database and turn it into structs.

Protector answered 30/5, 2022 at 21:59 Comment(1)
I'm not sure there is a fast way to iterate rows without first transposing the dataframe; dataframes are columnar structures, so pulling all of the data together for a row means a lookup per column.Norma
F
11

If you activate the rows feature in polars, you can try:

DataFrame::get_row and DataFrame::get_row_amortized.

The latter is preferred, as that reduces heap allocations by reusing the row buffer.

Anti-pattern

This will be slow. Asking for rows from a columnar data storage will incur many cache misses and goes trough several layers of indirection.

Slightly better

What would be slightly better is using rust iterators. This will have less indirection than the get_row methods.

df.as_single_chunk_par();
let mut iters = df.columns(["foo", "bar", "ham"])?
    .iter().map(|s| s.iter()).collect::<Vec<_>>();

for row in 0..df.height() {
    for iter in &mut iters {
        let value = iter.next().expect("should have as many iterations as rows");
        // process value
    }
}

If your DataFrame consists of a single data type, you should downcast the Series to a ChunkedArray, this will speed up iteration.

In the snippet below, we'll assume the data type is Float64.

let mut iters = df.columns(["foo", "bar", "ham"])?
    .iter().map(|s| Ok(s.f64()?.into_iter())).collect::<Result<Vec<_>>>()?;

for row in 0..df.height() {
    for iter in &mut iters {
        let value = iter.next().expect("should have as many iterations as rows");
        // process value
    }
}
Fielding answered 31/5, 2022 at 6:50 Comment(2)
This is good to know, but how does one access items in the row by key? This just iterates through them all, and you have no way to know which column you're in without keeping track of it yourself by column index.Mere
One approach to this is to convert the dataframe into a Vec of structs and then iterate over that.Mere
E
0

Add itertools to your Cargo.toml:

[dependencies]
itertools = "0.13.0"

ritchie46's method is still column iteration, not row iteration. In fact, the Series of dataframe can be taken out and iterated directly. We need to use itertools::multizip. This is much more efficient than the built-in df.get_row function of polars.

    use polars::prelude::*;
    use itertools::multizip;
    #[derive(Debug)]
    pub struct Person {
        id: u32,
        name: String,
        age: u32,
    }
    let df = df!(
        "id" => &[1u32,2,3],
        "name" => &["John", "Jane", "Bobby"],
        "age" => &[32u32, 28, 45]
    )
    .unwrap();

    let objects = df.take_columns();
    let id_ = objects[0].u32()?.iter();
    let name_ = objects[1].str()?.iter();
    let age_=objects[2].u32()?.iter();
    
    let combined = multizip((id_, name_, age_));
    let res: Vec<_>= combined.map(
        |(a, b, c)|{
            Person{
                id:a.unwrap(),
                name:b.unwrap().to_owned(),
                age:c.unwrap(),
            }
        }).collect();
       print!("{:?}",res);
Excitant answered 1/10 at 10:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.