I'm using polars and I would like to define the type of the columns while loading a dataframe. In pandas, I can use dtype
:
df=pd.read_csv("iris.csv", dtype={'petal_length':str})
I'm trying to do the same thing in polars, but without success until now. Here is what I have tried:
use polars::prelude::*;
use std::fs::File;
use std::collections::HashMap;
fn main() {
let df = example();
println!("{:?}", df.expect("Cannot find dataframe").head(Some(10)))
}
fn example() -> Result<DataFrame> {
let file = File::open("iris.csv")
.expect("could not read file");
let mut myschema = HashMap::new();
myschema.insert("sepal_length", f64);
myschema.insert("sepal_width", f64);
myschema.insert("petal_length",String);
myschema.insert("petal_width", f64);
myschema.insert("species", String);
CsvReader::new(file)
.with_schema(myschema)
.has_header(true)
.finish()
}
My doubt is what type of data the implementation with_schema
expects? I printed the schema of the DataFrame loaded using infer_schema(None)
.This prints a object that looks like a dictionary:
Schema { fields: [Field { name: "sepal_length", data_type: Float64 }, Field { name: "sepal_width", data_type: Float64 }, Field { name: "petal_length", data_type: Float64 }, Field { name: "petal_width", data_type: Float64 }, Field { name: "species", data_type: Utf8 }] }
But I cannot figure what object I should use to implement my schema.
Also, there is a way to specify the type of one variable, instead of all of them?
The with_schema
method expects an Arc<Schema>
type, not a Hashmap
.
The following code works:
use polars::prelude::*;
use std::sync::Arc;
fn example() -> Result<DataFrame> {
let file = "iris.csv";
let myschema = Schema::new(
vec![
Field::new("sepal_length", DataType::Float64),
Field::new("sepal_width", DataType::Float64),
Field::new("petal_length", DataType::Utf8),
Field::new("petal_width", DataType::Float64),
Field::new("species", DataType::Utf8),
]
);
CsvReader::from_path(file)?
.with_schema(Arc::new(myschema))
.has_header(true)
.finish()
}
Also, there is a way to specify the type of one variable, instead of all of them?
Yes, you can use with_dtype_overwrite
. Which expects a partial schema.