pythonmachine-learningdatasetparquetfastparquet

Can a parquet file exceed 2.1GB?


I'm having an issue storing a large dataset (around 40GB) in a single parquet file.

I'm using the fastparquet library to append pandas.DataFrames to this parquet dataset file. The following is a minimal example program that appends chunks to a parquet file until it crashes as the file-size in bytes exceeds the int32 threshold of 2147483647 (2.1GB):

Link to minimum reproducible example code

Everything goes fine until the dataset hits 2.1GB, at which point I get the following errors:

OverflowError: value too large to convert to int
Exception ignored in: 'fastparquet.cencoding.write_thrift'

Because the exception is ignored internally, it's very hard to figure out which specific thrift it's upset about and get a stack trace. However, it's very clear that it is linked to the file size exceeding the int32 range.

Also these thrift definitions come from the parquet format repo itself, so I wonder if this is a limitation built into the design of the parquet format?


Solution

  • Finally, I figured out that I was running into a genuine bug in the python library fastparquet, which resulted in a fix in the main library.

    This is a link to the salient issue on Github.

    The commit in which the issue is fixed is 89d16a2.