parquetapache-iceberg

Renamed column is returning null from existing data


I'm experimenting with Apache Iceberg, and trying to understand how column renaming works. In my scenario I'm working with an existing datalake of Parquet files stored in AWS S3. My goal is to create Iceberg tables using the existing files, without having to move or re-write any data.

All the information I can find regarding column renaming seems to suggest that it should just work. Using the Iceberg Java SDK:

icebergTable.updateSchema()
    .renameColumn("old_name", "new_name")
    .commit();

When I do this though, and query existing data (where the column is stored as 'old_name' in the Parquet files), I get all null values returned for column 'new_name'. I was expecting the original 'old_name' values to be mapped into the 'new_name' column.

Is this expectation valid? Am I missing something about how Iceberg column renaming works?


Edit: additional detail (for posterity - I'm not sure anyone will have an answer for this)

I've narrowed down the issue a bit further, and it appears to apply only to data in Parquet files that were originally created by something other than Iceberg (e.g. Parquet files written using the standard Apache Parquet libs). These files can be added to an Iceberg table using the appendFile() function (https://iceberg.apache.org/javadoc/1.6.1/org/apache/iceberg/AppendFiles.html#appendFile(org.apache.iceberg.DataFile) ). Data created this way, then appended to an Iceberg table, does not appear to properly track column renames.

Interestingly, a Parquet file that was originally created by Iceberg, can also be appended to another Iceberg table, in the same way, and that data does properly track column renames, even if the file was copied and/or moved from it's original Iceberg location. So it seems there's something unique about the Parquet files that are created by Iceberg that allows them to track column renames.


Solution

  • So it turns out I was missing an important piece during Iceberg table creation. Since the parquet files that I'm adding are created outside of the Iceberg ecosystem, they lack some key metadata that Iceberg adds when it writes Parquet: namely the field-id.

    To support adding externally created Parquet files, Iceberg provides the ability to define this metadata on the table itself. This is done with the schema.name-mapping.default property (described here: https://iceberg.apache.org/spec/#name-mapping-serialization). It sounds like this property should be used any time non-Iceberg Parquet files are included in an Iceberg table.

    Specifically, in this case, the table is created using the following code:

    val nameMapping = """[
        {"field-id": 1, "names": ["old_name"]}
    ]"""
    catalog.buildTable(tableId, schema)
        .withPartitionSpec(partitionSpec)
        .withProperties(mapOf(TableProperties.DEFAULT_NAME_MAPPING to nameMapping))
        .create()
    

    With this additional information defined on the table, the column rename as done in the original question now works.