I have a delta table in s3 that I write to via data bricks. The demo code I have is as so:
mount_name = '/mnt/dummy'
col_names = ["id","name"]
rows = [
(
"id1",
"name1"
)
]
df_output = spark.createDataFrame(rows, col_names)
df_output.write.mode("append").format("delta").option("overwriteSchema", "true").option("mergeSchema", "true").option("header","true").save(mount_name + "/dummyKey/")
It works and I can see that when I write a new entry, the delta table returns all data when I read it. Now, I noticed that if I write a dataframe with a same row that already exists in my delta table, that row will be saved twice. So when reading my delta table, I would see 2 rows with the same id & name column values (id1
& name1
).
I dont want that. I want delta table to not store duplicates if data that I write already exists, ie I want to perform an an update not insert. But how can I do this? Is there an .option()
that achieves that? Or do I need to use sql commands instead of spark's .write()
?
The reason I like using write()
as illustrated in my demo code is that should I get an additional column, the said column will automatically be added to my delta table, with old entries having null values in the new column. Is there a sql command that can achieve both, ie don't add existing data & handle new schema like write()
does?
It seems that the above can be achieved using MERGE sql function as outlined in the documentation:
https://docs.databricks.com/aws/en/sql/language-manual/delta-merge-into