rustrust-polars

How to define types of columns while loading dataframe in polars?


I'm using polars and I would like to define the type of the columns while loading a dataframe. In pandas, I can use dtype:

df=pd.read_csv("iris.csv", dtype={'petal_length':str})

I'm trying to do the same thing in polars, but without success until now. Here is what I have tried:

use polars::prelude::*;
use std::fs::File;
use std::collections::HashMap;


fn main() {
    let df = example();
    println!("{:?}", df.expect("Cannot find dataframe").head(Some(10)))
}

fn example() -> Result<DataFrame> {
    let file = File::open("iris.csv")
                    .expect("could not read file");
    let mut myschema = HashMap::new();
    myschema.insert("sepal_length", f64);
    myschema.insert("sepal_width", f64); 
    myschema.insert("petal_length",String); 
    myschema.insert("petal_width", f64); 
    myschema.insert("species", String); 

    CsvReader::new(file)
            .with_schema(myschema)
            .has_header(true)
            .finish()
}

My doubt is what type of data the implementation with_schema expects? I printed the schema of the DataFrame loaded using infer_schema(None).This prints a object that looks like a dictionary:

Schema { fields: [Field { name: "sepal_length", data_type: Float64 }, Field { name: "sepal_width", data_type: Float64 }, Field { name: "petal_length", data_type: Float64 }, Field { name: "petal_width", data_type: Float64 }, Field { name: "species", data_type: Utf8 }] }

But I cannot figure what object I should use to implement my schema.

Also, there is a way to specify the type of one variable, instead of all of them?


Solution

  • The with_schema method expects an Arc<Schema> type, not a Hashmap.

    The following code works:

    use polars::prelude::*;
    use std::sync::Arc;
    
    fn example() -> Result<DataFrame> {
        let file = "iris.csv";
    
        let myschema = Schema::new(
            vec![
                Field::new("sepal_length", DataType::Float64),
                Field::new("sepal_width", DataType::Float64),
                Field::new("petal_length", DataType::Utf8),
                Field::new("petal_width", DataType::Float64),
                Field::new("species", DataType::Utf8),
            ]
        );
    
        CsvReader::from_path(file)?
            .with_schema(Arc::new(myschema))
            .has_header(true)
            .finish()
    }
    

    Also, there is a way to specify the type of one variable, instead of all of them?

    Yes, you can use with_dtype_overwrite. Which expects a partial schema.