apache-sparkpysparkapache-iceberg

create iceberg partitioned table using pypspark


I want to create an iceberg table with partition using pyspark dataframe. I see how this can be done using spark sql but not with pyspark dataframes.
https://iceberg.apache.org/docs/1.4.3/spark-ddl/#partitioned-by

Can someone help with this?


Solution

  • Just use method partitionBy .

    df = spark.createDataFrame([(1, "foo"), (2, "bar")], ["key", "value"])
    df.write. \
        format("iceberg").\
        partitionBy("key"). \
        mode("append"). \
        saveAsTable("catalog_name.namespace.table_name")
    
    # check the DDL:
    spark.sql("SHOW CREATE TABLE catalog_name.namespace.table_name").show(1, 1000)
    

    The output will be something like the following:

    CREATE TABLE catalog_name.namespace.table_name (\n  key BIGINT,\n  value STRING)\nUSING iceberg\nPARTITIONED BY (key)...