I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.
Lets start with basics:
Parquet is a file format that needs to be saved in a file system.
Key questions:
append
operations?append
on files?append
operations?Answers:
parquet.hadoop.ParquetFileWriter
only supports CREATE
and OVERWRITE
; there is no append
mode. (Not sure but this could potentially change in other implementations -- parquet design does support append
)
HDFS allows append
on files using the dfs.support.append
property
Spark framework does not support append
to existing parquet files, and with no plans to; see this JIRA
It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.