Efficiently build a Polars DataFrame row by row in Rust
Asked Answered
Q

2

6

I would like to create a large Polars DataFrame using Rust, building it up row by row using data scraped from web pages. What is an efficient way to do this?

It looks like the DataFrame should be created from a Vec of Series rather than adding rows to an empty DataFrame. However, how should a Series be built up efficiently? I could create a Vec and then create a Series from the Vec, but that sounds like it will end up copying all elements. Is there a way to build up a Series element-by-element, and then build a DataFrame from those?

I will actually be building up several DataFrames in parallel using Rayon, then combining them, but it looks like vstack does what I want there. It's the creation of the individual DataFrames that I can't find out how to do efficiently.

I did look at the source of the CSV parser but that is very complicated, and probably highly optimised, but is there a simple approach that is still reasonably efficient?

Quartziferous answered 23/3, 2022 at 13:57 Comment(0)
B
2
pub fn from_vec(
    name: &str,
    v: Vec<<T as PolarsNumericType>::Native, Global>
) -> ChunkedArray<T>

Create a new ChunkedArray by taking ownership of the Vec. This operation is zero copy.

here is the link. You can then call into_series on it.

Benjaminbenji answered 23/3, 2022 at 16:12 Comment(1)
Thanks. It looks like this will work for numeric types, but not strings. However, your answer pointed me in the direction to find Utf8ChunkedBuilder which looks like it can.Quartziferous
K
1

The simplest, if perhaps not the most performant, answer is to just maintain a map of vectors and turn them into the series that get fed to a DataFrame all at once.

let columns = BTreeMap::new();
for datum in get_data_from_web() {
    // For simplicity suppose datum is itself a BTreeMap
    // (More likely it's a serde_json::Value)
    // It's assumed that every datum has the same keys; if not, the 
    // Vecs won't have the same length
    // It's also assumed that the values of datum are all of the same known type

    for (k, v) in datum {
        columns.entry(k).or_insert(vec![]).push(v);
    }
}

let df = DataFrame::new(
    columns.into_iter()
        .map(|(name, values)| Series::new(name, values))
        .collect::<Vec<_>>()
    ).unwrap();
Kara answered 24/3, 2022 at 3:54 Comment(1)
Thanks. That is easy to understand but I’d like to minimise copying. The ChunkedArray method proposed by Jakub looks good in that respect for numeric data and Utf8ChunkedBuilder looks good for strings.Quartziferous

© 2022 - 2024 — McMap. All rights reserved.