I have a huge dataframe and want to split it into small files for better performance. Here is the example code to write. BUT I can not just read a small pieces from it without loading whole dataframe into memory.
import pandas as pd
import os
# Create a sample DataFrame with daily frequency
data = {
"timestamp": pd.date_range(start="2023-01-01", periods=1000, freq="D"),
"value": range(100)
}
df = pd.DataFrame(data)
# Add a column for year (to use as a partition key)
df["year"] = df["timestamp"].dt.year
df["month"] = df["timestamp"].dt.month
# Use the join method to expand the DataFrame (Cartesian product with a multiplier)
multiplier = pd.DataFrame({"replica": range(100)}) # Create a multiplier DataFrame
expanded_df = df.join(multiplier, how="cross") # Cartesian product using cross join
# Define the output directory
output_dir = "output_parquet"
# Save the expanded DataFrame to Parquet with year-based partitioning
expanded_df.to_parquet(
output_dir,
partition_cols=["year", "month"], # Specify the partition column
)
Which is the best way to read from the dataset if I only need data from 2023-12-01
to 2024-01-31
?
You can load your dataset selectively if your parquet output is properly partitioned. You can use libraries like PyArrow
to let you filter the data at the file or partition level to make sure only the relevant data is loaded in to memory.
Here's how you can do it using pyarrow.dataset
:
import pyarrow.dataset as ds
dataset = ds.dataset("output_parquet", format="parquet", partitioning="hive")
filtered_table = dataset.to_table(filter=(ds.field("year") == 2023) & (ds.field("month") == 12)) # Or any other desired condition
filtered_df = filtered_table.to_pandas()
You may also use pyarrow.parquet.ParquetDataset
to achieve the same goal, but it's less optimized and a bit outdated:
import pyarrow.parquet as pq
dataset = pq.ParquetDataset("output_parquet", filters=[("year", "=", 2023), ("month", "=", 12)])
table = dataset.read()
df = table.to_pandas()