pythonpandasdataframepython-polars

Why polars date time subseting is slow?


I'm looking for a faster way to subset a polars dataframe using datetime. I tried 2 different ways and I believe there is a faster way since I also tried with pandas and it was faster:

datetime.datetime(2009, 8, 3, 0, 0)
datetime.datetime(2009, 11, 3, 0, 0)


# - 1st
%%timeit
df_window = df_stock.filter(
    (df_stock["date_time"] >= start_date)
    & (df_stock["date_time"] <= end_date)
)
# 2.61 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# - 2nd
%%timeit
df_window = df_stock.filter(
    (pl.col("date_time") >= start_date) & (pl.col("date_time") <= end_date)
)
# 12.5 ms ± 801 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas's subsetting is faster for datetime.

df_test = df_stock.to_pandas()
df_test.set_index('date_time', inplace=True)

%%timeit
df_test['2009-8-3':'2009-8-3']
# 1.02 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So why pandas faster than polars in this case? Is there a faster form of subsetting datetime on polars?

Enviroment:


Solution

  • Could this just be an issue of benchmarking a relatively small query? My benchmarking shows the exact opposite.

    Data

    Let's start with a good amount of data, so that the subsetting process represents the largest share of the work. I'll also shuffle the data, so that issues of "sortedness" are not involved.

    import time
    import polars as pl
    from datetime import datetime
    
    df_stock = pl.DataFrame(
        {
            "date_time": pl.datetime_range(
                start=datetime(2000, 1, 1, 0, 0),
                end=datetime(2023, 1, 1, 0, 0),
                interval="1s",
                eager=True,
            ).shuffle(seed=0)
        }
    )
    df_stock
    
    shape: (725_846_401, 1)
    ┌─────────────────────┐
    │ date_time           │
    │ ---                 │
    │ datetime[μs]        │
    ╞═════════════════════╡
    │ 2022-01-02 21:46:35 │
    │ 2003-12-16 18:57:19 │
    │ 2018-06-20 14:31:08 │
    │ 2020-04-09 07:18:38 │
    │ …                   │
    │ 2013-07-12 00:39:09 │
    │ 2021-09-08 17:18:59 │
    │ 2010-08-18 16:00:18 │
    │ 2010-02-06 05:13:05 │
    └─────────────────────┘
    

    725 million records should be a good start.

    Test Cases

    I've put together the three cases from your code into separate tests.

    Test #1

    start_date = datetime(2009, 8, 3, 0, 0)
    end_date = datetime(2009, 11, 3, 0, 0)
    
    start = time.perf_counter()
    _ = df_stock.filter(
        (df_stock["date_time"] >= start_date) & (df_stock["date_time"] <= end_date)
    )
    print(time.perf_counter() - start)
    

    Test #2

    start = time.perf_counter()
    _ = df_stock.filter(
        (pl.col("date_time") >= start_date) & (pl.col("date_time") <= end_date)
    )
    print(time.perf_counter() - start)
    

    Test #3

    df_test = df_stock.to_pandas()
    df_test.set_index("date_time", inplace=True)
    start = time.perf_counter()
    _ = df_test["2009-8-3":"2009-11-3"]
    print(time.perf_counter() - start)
    

    Here are the timings for three runs of the above on my system (a 32-core system).

    Test #1: 1.1206, 1.1199, 1.1202

    Test #2: 0.8342, 0.8215, 0.8227

    Test #3: 4.6082, 4.5932, 4.6185

    On my system, the Pandas indexing runs the slowest - by far. I could run this more than three times, but I think the pattern is quite clear.

    Can you run a larger test on your system? Perhaps there is something else at play here...

    For reference, version information on my test environment ...

    ---Version info---
    Polars: 0.15.7
    Index type: UInt32
    Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
    Python: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
    ---Optional dependencies---
    pyarrow: 10.0.1
    pandas: 1.5.2
    numpy: 1.24.0
    fsspec: 2022.11.0
    connectorx: 0.3.1
    xlsx2csv: 0.8
    matplotlib: 3.6.2