I'm having an issue storing a large dataset (around 40GB) in a single parquet file.
I'm using the fastparquet
library to append pandas.DataFrames
to this parquet dataset file. The following is a minimal example program that appends chunks to a parquet file until it crashes as the file-size in bytes exceeds the int32 threshold of 2147483647 (2.1GB):
Link to minimum reproducible example code
Everything goes fine until the dataset hits 2.1GB, at which point I get the following errors:
OverflowError: value too large to convert to int
Exception ignored in: 'fastparquet.cencoding.write_thrift'
Because the exception is ignored internally, it's very hard to figure out which specific thrift it's upset about and get a stack trace. However, it's very clear that it is linked to the file size exceeding the int32 range.
Also these thrift definitions come from the parquet format repo itself, so I wonder if this is a limitation built into the design of the parquet format?
Finally, I figured out that I was running into a genuine bug in the python library fastparquet
, which resulted in a fix in the main library.
This is a link to the salient issue on Github.
The commit in which the issue is fixed is 89d16a2.