apache-sparkpysparkhdfsstreamingapache-iceberg

Why is metadata consuming large amount of storage and how to optimize it?


I'm using PySpark with Apache Iceberg on an HDFS-based data lake, and I'm encountering significant storage issues. My application ingests real-time data every second. After approximately 2 hours, I get an error indicating that storage is full. Upon investigating the HDFS folder (which stores both data and metadata), I noticed that Iceberg's metadata consumes a surprisingly large amount of storage compared to the actual data.

enter image description here

Here’s my Spark configuration:

CONF = (
    pyspark.SparkConf()
    .set("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0")
    .set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
    .set("spark.sql.catalog.spark_catalog.type", "hive")
    .set("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
    .set("spark.sql.catalog.local.type", "hadoop")
    .set("spark.sql.catalog.local.warehouse", 'warehouse')
    .set("spark.sql.defaultCatalog", "local")
    .set("spark.driver.port", "7070")
    .set("spark.ui.port", "4040")
)

My table creation code:

def get_create_channel_data_table_query() -> str:
    return f"""
    CREATE TABLE IF NOT EXISTS local.db.{CHANNEL_DATA_TABLE} (
        uuid STRING,
        channel_set_channel_uuid STRING,
        data FLOAT,
        file_uri STRING,
        index_depth FLOAT,
        index_time BIGINT,
        creation TIMESTAMP,
        last_update TIMESTAMP
    )
    """

Inserting data into the table:

def insert_row_into_table(spark: SparkSession, schema: StructType, table_name: str, row: tuple):
    df = spark.createDataFrame([row], schema)
    df.writeTo(f"local.db.{table_name}").append()

Problem:

Iceberg's metadata grows rapidly and takes up a huge amount of space relative to the data. I’m unsure why metadata consumption is so high.

Question:

What causes metadata to consume so much space ? are there best practices or configurations I could apply to reduce metadata storage ?


Solution

  • I've got the right answer to my question from here on Reddit

    Inserting one row at a time will cause single row based snapshot/transaction being created. So if you are inserting thousands or millions of rows, expect each to create a snapshot and also create a parquet of few KBs with just one record in it. INSERT INTO table VALUES syntax will cause all kind of issues including the one you reported.

    I will suggest the following:

    Try creating a staging table/file/in-memory location where you can store the intermediate rows before doing a BULK upload into Iceberg table. This way there will be only one snapshot/transaction and also the underline parquet file size will also be optimal like 256MB standard. For example create a csv file to hold few thousands of records in the staging area and then upload that csv file as bulk upload to Iceberg since then it will be considered a single snapshot/transaction, this will keep your metadata folder space small in size.

    If you want to continue with your existing approach than Iceberg provide options to expires and delete snapshots. You might want to run the expire command after few thousand records being inserted. This way your metadata folder space will be under check.

    My recommendation will be Option#1.