apache-sparkparquet

Updating values in apache parquet file


I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this.


Solution

  • Lets start with basics:

    Parquet is a file format that needs to be saved in a file system.

    Key questions:

    1. Does parquet support append operations?
    2. Does the file system (namely, HDFS) allow append on files?
    3. Can the job framework (Spark) implement append operations?

    Answers:

    1. parquet.hadoop.ParquetFileWriter only supports CREATE and OVERWRITE; there is no append mode. (Not sure but this could potentially change in other implementations -- parquet design does support append)

    2. HDFS allows append on files using the dfs.support.append property

    3. Spark framework does not support append to existing parquet files, and with no plans to; see this JIRA

    It is not a good idea to append to an existing file in distributed systems, especially given we might have two writers at the same time.