When setting up a file-based sync in Data Connection, I see there are a few different options for 'Transaction Type'. What's the difference between them? When might I use them?
From the Foundry docs:
The way dataset files are modified in a transaction depends on the transaction type. There are four possible transaction types: SNAPSHOT
, APPEND
, UPDATE
, and DELETE
.
SNAPSHOT
A SNAPSHOT
transaction replaces the current view of the dataset with a completely new set of files.
SNAPSHOT
transactions are the simplest transaction type, and are the basis of batch pipelines.
APPEND
An APPEND
transaction adds new files to the current dataset view.
An APPEND
transaction cannot modify existing files in the current dataset view. If an APPEND
transaction is opened and existing files are overwritten, then attempting to commit the transaction will fail.
APPEND
transactions are the basis of incremental pipelines. By only syncing new data into Foundry and only processing this new data throughout the pipeline, changes to large datasets can be processed end-to-end in a performant way. However, building and maintaining incremental pipelines comes with additional complexity. Learn more about incremental pipelines.
UPDATE
An UPDATE
transaction, like an APPEND
, adds new files to a dataset view, but may also overwrite the contents of existing files.
DELETE
A DELETE
transaction removes files that are in the current dataset view.
Note that committing a DELETE
transaction does not delete the underlying file from the backing file system—it simply removes the file reference from the dataset view.
In practice, DELETE
transactions are mostly used to enable data retention workflows. By deleting files on a dataset based on a retention policy—typically based on the age of the file—data can be removed from Foundry, both to minimize storage costs and to comply with data governance requirements.
Data Connection doesn't let you create a sync with a DELETE
transaction type, because a sync that purely deletes data doesn't really make sense! If you'd like to delete data from your sync'd dataset, you can use a SNAPSHOT transaction to do so, but note that previous versions of the dataset will still include those files.
You can combine an APPEND or UPDATE transaction type with file-based sync filters to only ingest the newly changed files on each run of your sync.