apache-sparkdatabricksdelta-lake

How to update a row in delta table and make sure that old version is removed?


I have a delta table in s3 that I write to via data bricks. The demo code I have is as so:

mount_name = '/mnt/dummy'
col_names = ["id","name"]
rows = [
    (
        "id1",
        "name1"
    )      
]
df_output = spark.createDataFrame(rows, col_names)
df_output.write.mode("append").format("delta").option("overwriteSchema", "true").option("mergeSchema", "true").option("header","true").save(mount_name + "/dummyKey/")

It works and I can see that when I write a new entry, the delta table returns all data when I read it. Now, I noticed that if I write a dataframe with a same row that already exists in my delta table, that row will be saved twice. So when reading my delta table, I would see 2 rows with the same id & name column values (id1 & name1).

I dont want that. I want delta table to not store duplicates if data that I write already exists, ie I want to perform an an update not insert. But how can I do this? Is there an .option() that achieves that? Or do I need to use sql commands instead of spark's .write()?

The reason I like using write() as illustrated in my demo code is that should I get an additional column, the said column will automatically be added to my delta table, with old entries having null values in the new column. Is there a sql command that can achieve both, ie don't add existing data & handle new schema like write() does?


Solution

  • It seems that the above can be achieved using MERGE sql function as outlined in the documentation:
    https://docs.databricks.com/aws/en/sql/language-manual/delta-merge-into