databricksdelta-lake

Databricks Pyspark writing Delta format mode overwrite is not working propertly


I have the following code

Previously I have a delta table with 180 columns in my_path´, I select a column and try to overwrite

    columns_to_select = ["one_column"]
    df_one_column = df.select(*columns_to_select)
    df_one_column.write.format("delta").mode("overwrite").option("mergeSchema", "true").save(my_path)
    
    new_schema = spark.read.format("delta").load(my_path).schema
    target_column = [field.name for field in new_schema.fields]
    print(len(target_column)) # return 180

I expected to return 1 as I just select one column from my Dataframe but is returning 180 colums


Solution

  • You need to use option("overwriteSchema", "True") while writing

    Here is the sample example

    df.write.format("delta").mode("overwrite").save(my_path)
    df_first = spark.read.format("delta").load(my_path)
    print(df_first.columns, len(df_first.columns))
    

    Please fins the below screenshot

    columns_to_select = ["firstname"]
    df_one_column = df.select(*columns_to_select)
    df_one_column.write.format("delta").mode("overwrite").option("overwriteSchema", "True").option("mergeSchema", "true").save(my_path) 
    df_second = spark.read.format("delta").load(my_path)
    print(df_second.columns, len(df_second.columns))
    

    Please find the below screenshot

    Please refer following links for more info

    mergeSchema: https://delta.io/blog/2023-02-08-delta-lake-schema-evolution/

    overwriteSchema: https://docs.databricks.com/en/delta/update-schema.html#explicitly-update-schema-to-change-column-type-or-name