I am writing out a compressed Parquet file from DataFrame
as following:
result_df.to_parquet("my-data.parquet", compression="zstd")
How can I instruct Pandas on the compression level of zstd coding?
Using pyarrow
engine you can send compression_level
in kwargs
to to_parquet
result_df.to_parquet(path, engine='pyarrow', compression='zstd', compression_level=1)
Test:
import pandas as pd
import pyarrow.parquet as pq
path = 'my-data.parquet'
result_df = pd.DataFrame({'a': range(100000)})
for i in range(10):
# create the file
result_df.to_parquet(path, engine='pyarrow', compression='zstd', compression_level=i)
# get compressed file size
metadata = pq.ParquetFile(path).metadata.row_group(0).column(0)
print(f'compression level {i}: {metadata.total_compressed_size}')
Output:
compression level 0: 346166
compression level 1: 309501
compression level 2: 309500
compression level 3: 346166
compression level 4: 355549
compression level 5: 381823
compression level 6: 310104
compression level 7: 310088
compression level 8: 308866
compression level 9: 308866