I'm looking for a faster way to subset a polars dataframe using datetime. I tried 2 different ways and I believe there is a faster way since I also tried with pandas and it was faster:
datetime.datetime(2009, 8, 3, 0, 0)
datetime.datetime(2009, 11, 3, 0, 0)
# - 1st
%%timeit
df_window = df_stock.filter(
(df_stock["date_time"] >= start_date)
& (df_stock["date_time"] <= end_date)
)
# 2.61 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# - 2nd
%%timeit
df_window = df_stock.filter(
(pl.col("date_time") >= start_date) & (pl.col("date_time") <= end_date)
)
# 12.5 ms ± 801 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pandas's subsetting is faster for datetime.
df_test = df_stock.to_pandas()
df_test.set_index('date_time', inplace=True)
%%timeit
df_test['2009-8-3':'2009-8-3']
# 1.02 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So why pandas faster than polars in this case? Is there a faster form of subsetting datetime on polars?
Enviroment:
Could this just be an issue of benchmarking a relatively small query? My benchmarking shows the exact opposite.
Let's start with a good amount of data, so that the subsetting process represents the largest share of the work. I'll also shuffle
the data, so that issues of "sortedness" are not involved.
import time
import polars as pl
from datetime import datetime
df_stock = pl.DataFrame(
{
"date_time": pl.datetime_range(
start=datetime(2000, 1, 1, 0, 0),
end=datetime(2023, 1, 1, 0, 0),
interval="1s",
eager=True,
).shuffle(seed=0)
}
)
df_stock
shape: (725_846_401, 1)
┌─────────────────────┐
│ date_time │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2022-01-02 21:46:35 │
│ 2003-12-16 18:57:19 │
│ 2018-06-20 14:31:08 │
│ 2020-04-09 07:18:38 │
│ … │
│ 2013-07-12 00:39:09 │
│ 2021-09-08 17:18:59 │
│ 2010-08-18 16:00:18 │
│ 2010-02-06 05:13:05 │
└─────────────────────┘
725 million records should be a good start.
I've put together the three cases from your code into separate tests.
Test #1
start_date = datetime(2009, 8, 3, 0, 0)
end_date = datetime(2009, 11, 3, 0, 0)
start = time.perf_counter()
_ = df_stock.filter(
(df_stock["date_time"] >= start_date) & (df_stock["date_time"] <= end_date)
)
print(time.perf_counter() - start)
Test #2
start = time.perf_counter()
_ = df_stock.filter(
(pl.col("date_time") >= start_date) & (pl.col("date_time") <= end_date)
)
print(time.perf_counter() - start)
Test #3
df_test = df_stock.to_pandas()
df_test.set_index("date_time", inplace=True)
start = time.perf_counter()
_ = df_test["2009-8-3":"2009-11-3"]
print(time.perf_counter() - start)
Here are the timings for three runs of the above on my system (a 32-core system).
Test #1: 1.1206, 1.1199, 1.1202
Test #2: 0.8342, 0.8215, 0.8227
Test #3: 4.6082, 4.5932, 4.6185
On my system, the Pandas indexing runs the slowest - by far. I could run this more than three times, but I think the pattern is quite clear.
Can you run a larger test on your system? Perhaps there is something else at play here...
For reference, version information on my test environment ...
---Version info---
Polars: 0.15.7
Index type: UInt32
Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Python: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.24.0
fsspec: 2022.11.0
connectorx: 0.3.1
xlsx2csv: 0.8
matplotlib: 3.6.2