architecturebigdatadatabricksdelta-lakedata-lakehouse

Upserts on Delta simply duplicates data?


I'm fairly new with Delta and lakehouse on databricks. I have some questions, based on the following actions:

Does this mean delta simply duplicates data for every new version?

How is this scalable? or am I missing something?


Solution

  • Yes, that's how Delta lake works - when you're doing modification of the data, it won't write only delta, but takes the original file that is affected by change, make changes, and write it back. But take into account that not all data is duplicated - only that were in the file where affected rows are. For example, you have 3 data files, and you're making changes to some rows that are in the 2nd file. In this case, Delta will create a new file with number 4 that contains necessary changes + the rest of data from file 2, so you will have following versions: