How to define types of columns while loading dataframe in polars?
Asked Answered
L

3

5

I'm using polars and I would like to define the type of the columns while loading a dataframe. In pandas, I can use dtype:

df=pd.read_csv("iris.csv", dtype={'petal_length':str})

I'm trying to do the same thing in polars, but without success until now. Here is what I have tried:

use polars::prelude::*;
use std::fs::File;
use std::collections::HashMap;


fn main() {
    let df = example();
    println!("{:?}", df.expect("Cannot find dataframe").head(Some(10)))
}

fn example() -> Result<DataFrame> {
    let file = File::open("iris.csv")
                    .expect("could not read file");
    let mut myschema = HashMap::new();
    myschema.insert("sepal_length", f64);
    myschema.insert("sepal_width", f64); 
    myschema.insert("petal_length",String); 
    myschema.insert("petal_width", f64); 
    myschema.insert("species", String); 

    CsvReader::new(file)
            .with_schema(myschema)
            .has_header(true)
            .finish()
}

My doubt is what type of data the implementation with_schema expects? I printed the schema of the DataFrame loaded using infer_schema(None).This prints a object that looks like a dictionary:

Schema { fields: [Field { name: "sepal_length", data_type: Float64 }, Field { name: "sepal_width", data_type: Float64 }, Field { name: "petal_length", data_type: Float64 }, Field { name: "petal_width", data_type: Float64 }, Field { name: "species", data_type: Utf8 }] }

But I cannot figure what object I should use to implement my schema.

Also, there is a way to specify the type of one variable, instead of all of them?

Lockhart answered 16/4, 2021 at 17:14 Comment(0)
C
4

The with_schema method expects an Arc<Schema> type, not a Hashmap.

The following code works:

use polars::prelude::*;
use std::sync::Arc;

fn example() -> Result<DataFrame> {
    let file = "iris.csv";

    let myschema = Schema::new(
        vec![
            Field::new("sepal_length", DataType::Float64),
            Field::new("sepal_width", DataType::Float64),
            Field::new("petal_length", DataType::Utf8),
            Field::new("petal_width", DataType::Float64),
            Field::new("species", DataType::Utf8),
        ]
    );

    CsvReader::from_path(file)?
        .with_schema(Arc::new(myschema))
        .has_header(true)
        .finish()
}

Also, there is a way to specify the type of one variable, instead of all of them?

Yes, you can use with_dtype_overwrite. Which expects a partial schema.

Chaisson answered 17/4, 2021 at 7:19 Comment(2)
Today, this code does not compile anymore. When I try to use it, I get the message: argument of type Vec<polars::prelude::Field> unexpected Any idea?Waterspout
you might have to make the VECTOR into an iterator using .into_iter()Vitrain
F
1

A slight update to ritche46's answer. As Robert stated, the vector needs to be changed to an iterator. And it looks like we should use from now instead of new? I've not executed the code below, but it compiles.

...
        let myschema = Schema::from(
            vec![
                Field::new("sepal_length", DataType::Float64),
                Field::new("sepal_width", DataType::Float64),
                Field::new("petal_length", DataType::Utf8),
                Field::new("petal_width", DataType::Float64),
                Field::new("species", DataType::Utf8),
            ]
            .into_iter(),
        );
...
Fawkes answered 4/3, 2023 at 18:29 Comment(0)
M
1

The above code with Schema::new will not compile as of today. The solution is to use:

    let myschema = Schema::from_iter(
        vec![
            Field::new("sepal_length", DataType::Float64),
            Field::new("sepal_width", DataType::Float64),
            Field::new("petal_length", DataType::String),
            Field::new("petal_width", DataType::Float64),
            Field::new("species", DataType::Utf8),
        ]
    );
Mutual answered 17/8 at 14:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.