pythonpandasparquetzstd

Pandas DataFrame.write_parquet() and setting the Zstd compression level


I am writing out a compressed Parquet file from DataFrame as following:

result_df.to_parquet("my-data.parquet", compression="zstd")

How can I instruct Pandas on the compression level of zstd coding?


Solution

  • Using pyarrow engine you can send compression_level in kwargs to to_parquet

    result_df.to_parquet(path, engine='pyarrow', compression='zstd', compression_level=1)
    

    Test:

    import pandas as pd
    import pyarrow.parquet as pq
    
    path = 'my-data.parquet'
    result_df = pd.DataFrame({'a': range(100000)})
    
    for i in range(10):
    
        # create the file
        result_df.to_parquet(path, engine='pyarrow', compression='zstd', compression_level=i)
    
        # get compressed file size
        metadata = pq.ParquetFile(path).metadata.row_group(0).column(0)
        print(f'compression level {i}: {metadata.total_compressed_size}')
    

    Output:

    compression level 0: 346166
    compression level 1: 309501
    compression level 2: 309500
    compression level 3: 346166
    compression level 4: 355549
    compression level 5: 381823
    compression level 6: 310104
    compression level 7: 310088
    compression level 8: 308866
    compression level 9: 308866