apache-sparkapache-iceberg

why Iceberg rewriteDataFiles doesn't rewrite the files to one file?


I have an iceberg table with 2 parquets files store 4 rows in s3 I tried the following command:

val tables = new HadoopTables(conf);
val table = tables.load("s3://iceberg-tests-storage/data/db/test5");    
SparkActions.get(spark).rewriteDataFiles(table).option("target-file-size-bytes", "52428800").execute();

but nothing changed. what I'm doing wrong?


Solution

  • A few notes:

    1. Iceberg by default won't compact files unless a minimum number of small files are available to compact per file group and per partition. The default is 5.
    2. Iceberg won't compact files across partitions, as one file must map 1:1 to a tuple of partition values.
      • As an example: for a table partitioned by col1 and col2, files with col1=A and col2=1 cannot be compacted with files with col1=A and col2=4

    In your case, if you set min-input-files to 2, provided the files are part of the same partition or the table is not partitioned, the files should be compacted together.