Why is that pola.rs rust-code considerably slower than the python version?

I am currently comparing different DataFrame based libs in python and rust. Of course I also check pola.rs, as that lib can be used from within both programming languages.

I tried to write the same code in both languages, as much as I understood and could. And at first, with first test-data (CSV files with 2k and 1k lines), the performance of both is mainly the same.

Then I increased the data by copying the same content additionally 9 times into the same files. Creating 20k and 10k of lines. The python version is slightly slower with the bigger files. The rust version takes 8 times the time with the bigger files.

I don't understand why. Hopefully someone can give me a pointer where I use the rust lib incorrectly and thus tanking the performance.

First, for comparison the python code:

import polars as pl

LOCATION = "London Westminster"

def main():
    no2_quality = pl.read_csv("data/air_quality_no2_long.csv", try_parse_dates=True)
    no2_quality = no2_quality.filter(pl.col("location") == LOCATION)
    no2_grouped = no2_quality.group_by(
        (pl.col("date.utc").dt.date()).alias("date"),
        (pl.col("date.utc").dt.hour()).alias("hour"),
        maintain_order=True,
    ).sum()

    pm25_quality = pl.read_csv("data/air_quality_pm25_long.csv", try_parse_dates=True)
    pm25_quality = pm25_quality.filter(pl.col("location") == LOCATION)
    pm25_grouped = pm25_quality.group_by(
        (pl.col("date.utc").dt.date()).alias("date"),
        (pl.col("date.utc").dt.hour()).alias("hour"),
        maintain_order=True,
    ).sum().sort(["date", "hour"])

    by_day = no2_grouped.group_by(
        (pl.col("date")),
    ).sum()
    by_day = by_day.sort("value", descending=True)
    top = by_day.head(3).select(
        pl.col("date"),
        pl.col("value"),
    )
    print(top)

    top_hourly = (top.join(no2_grouped, on="date", how="left", suffix="_no2")
                  .join(pm25_grouped, on=["date", "hour"], how="left", suffix="_pm25"))
    top_hourly = top_hourly.group_by(
        pl.col("hour"),
    ).mean().sort("hour")
    print(top_hourly.select(
        pl.col("hour"),
        pl.col("value_no2"),
        pl.col("value_pm25"),
    ))

    bottom = by_day.tail(3).select(
        pl.col("date"),
        pl.col("value"),
    )
    print(bottom)

    bottom_hourly = (bottom.join(no2_grouped, on="date", how="left", suffix="_no2")
                  .join(pm25_grouped, on=["date", "hour"], how="left", suffix="_pm25")).sort(["date", "hour"])
    bottom_hourly = bottom_hourly.group_by(
        pl.col("hour"),
    ).mean().sort("hour")
    print(bottom_hourly.select(
        pl.col("hour"),
        pl.col("value_no2"),
        pl.col("value_pm25"),
    ))


if __name__ == "__main__":
    main()

Now the rust code:

use polars::prelude::*;

const LOCATION: &str = "London Westminster";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let no2_quality = CsvReadOptions::default()
        .with_infer_schema_length(None)
        .with_has_header(true)
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some("data/air_quality_no2_long.csv".into()))?
        .finish()?;
    let no2_quality = no2_quality.lazy().filter(col("location").eq(lit(LOCATION)));
    let no2_grouped = no2_quality
        .clone()
        .group_by_stable([
            col("date.utc").dt().date().alias("date"),
            col("date.utc").dt().hour().alias("hour"),
        ])
        .agg([sum("value")]);

    let pm25_quality = CsvReadOptions::default()
        .with_infer_schema_length(None)
        .with_has_header(true)
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some("data/air_quality_pm25_long.csv".into()))?
        .finish()?;
    let pm25_quality = pm25_quality
        .lazy()
        .filter(col("location").eq(lit(LOCATION)));
    let pm25_grouped = pm25_quality
        .group_by_stable([
            col("date.utc").dt().date().alias("date"),
            col("date.utc").dt().hour().alias("hour"),
        ])
        .agg([sum("value")]);

    let by_day = no2_quality
        .group_by([col("date.utc").dt().date().alias("date")])
        .agg([sum("value")]);
    let by_day = by_day.sort(
        ["value"],
        SortMultipleOptions::default().with_order_descending(true),
    );
    let top = by_day.clone().limit(3).collect()?; //
    println!("{}", top);

    let top_hourly = top.lazy().join(
        no2_grouped.clone(),
        [col("date")],
        [col("date")],
        JoinArgs::default().with_suffix(Some("_no2".into())),
    );
    let top_hourly = top_hourly.join(
        pm25_grouped.clone(),
        [col("date"), col("hour")],
        [col("date"), col("hour")],
        JoinArgs::default().with_suffix(Some("_pm25".into())),
    );
    let top_hourly = top_hourly
        .group_by([col("hour")])
        .agg([mean("value_no2"), mean("value_pm25")])
        .sort(["hour"], SortMultipleOptions::default())
        .collect()?;
    println!("{}", top_hourly);

    let bottom = by_day.tail(3).collect()?;
    println!("{}", bottom);

    let bottom_hourly = bottom.lazy().join(
        no2_grouped.clone(),
        [col("date")],
        [col("date")],
        JoinArgs::default().with_suffix(Some("_no2".into())),
    );
    let bottom_hourly = bottom_hourly.join(
        pm25_grouped.clone(),
        [col("date"), col("hour")],
        [col("date"), col("hour")],
        JoinArgs::default().with_suffix(Some("_pm25".into())),
    );
    let bottom_hourly = bottom_hourly
        .group_by([col("hour")])
        .agg([mean("value_no2"), mean("value_pm25")])
        .sort(["hour"], SortMultipleOptions::default())
        .collect()?;
    println!("{}", bottom_hourly);

    Ok(())
}

The input files are from pandas example:

It's not just an optimization problem, as I run a release-build (cargo build --release) with both file sizes. To compare the python to the rust version and the small file sizes to the bigger I use the sh-integrated tool time. It runs on the same hardware (Macbook M3 Pro) on the same OS.

2k / 1k:

================
CPU 131%
user    0.116
system  0.032
total   0.113

20k / 10k:

================
CPU 115%
user    0.934
system  0.038
total   0.842

And a different DataFrame lib in rust does not suffer such a performance penalty with bigger files. So I think it's not per se a language topic.

UPDATEs based on comments/feedbacks

collect calls

First I reduced the number of collect calls to only 4, right before the individual println-calls. (code above updated to reflect that change) Reducing the intermediate variables does not have any additional effect.

================
CPU 116%
user    0.967
system  0.046
total   0.870

lto=fat

I added lto=fat in the toml:

[profile.release]
lto = "fat"

Effect:

================
CPU 115%
user    1.031
system  0.043
total   0.925

consistently slightly higher then without. Do I need to combine that with some other optimization?

lto=fat & codegenunit=1

[profile.release]
codegen-units = 1
debug = false
lto = "fat"

Effect:

================
CPU 116%
user    0.931
system  0.047
total   0.842

jemalloc

without lto=fat:

================
CPU 116%
user    0.983
system  0.049
total   0.889

with lto=fat:

================
CPU 116%
user    1.033
system  0.055
total   0.935

nightly build

using nightly build installed via dustup and

[dependencies]
polars = { version = "0.46.0", features = ["lazy", "nightly"] }

Still had no major impact:

================
CPU 116%
user    0.925
system  0.046
total   0.836

Wrong Direction?

As none of the hints from the comments had any measurable improvements I am beginning to suspect the problem is not to be found in processing the data, but in reading it.

So I made following test version:

use jemallocator::Jemalloc;
use polars::prelude::*;

#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

const LOCATION: &str = "London Westminster";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let _no2_quality = CsvReadOptions::default()
        .with_infer_schema_length(None)
        .with_has_header(true)
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some("data/air_quality_no2_long.csv".into()))?
        .finish()?;

    let _pm25_quality = CsvReadOptions::default()
        .with_infer_schema_length(None)
        .with_has_header(true)
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some("data/air_quality_pm25_long.csv".into()))?
        .finish()?;

    Ok(())
}

Release-Build it without any features and let it run:

================
CPU 111%
user    0.925
system  0.016
total   0.842

So I currently tend to see that as a bug in the csv reading.

Thanks for all your responses and sorry to not have seen that earlier

Solution

tl;dr

Remove .with_infer_schema_length(None)

details

I opened a ticket (https://github.com/pola-rs/polars/issues/21298) on the pola.rs GitHub repository and received the answer there. This I want to share here.

The most relevant comment: https://github.com/pola-rs/polars/issues/21298#issuecomment-2717919596

Interpretation of functionality for date/time inference parsing:

"infer schema" tries up to 188 string patterns, 1x of 2x

it does so on every field in in scope

it does so for every row up to infer_schema_length is in scope

if "infer_schema_length" is not set, it defaults to 100 rows

if set to None, it processes (every field in) every row

Note that the inference is not cheap, and can significantly impact the performance.

After I removed the lines .with_infer_schema_length(None) from my rust code the performance was increased significantly, also for smaller files.

For reference, small file, unchanged, release-compiled with rustc 1.85.0 (4d91de4e4 2025-02-17)

================
CPU 132%
user    0.112
system  0.030
total   0.108

Changed, small file:

================
CPU 265%
user    0.036
system  0.038
total   0.028

Changed, big file:

================
CPU 497%
user    0.145
system  0.047
total   0.039