If I have a Magritte ingestion that is set to append, will it detect if rows are deleted in the source data? Will it also delete the rows in the ingested dataset?
For your first question on if deletions are detected, this will depend on the database implementation you are extracting from (I'll assume this is JDBC for this answer). If this shows up as a modification and therefore a new row, then yes your deletes will show up.
This would look something like the following at first:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | CREATE | 0 |
| key_2 | 2 | CREATE | 0 |
| key_3 | 3 | CREATE | 0 |
Followed by some updates (inside a subsequent run, incremental on update_ts
:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | UPDATE | 1 |
| key_2 | 2 | UPDATE | 1 |
Now your database would have to explicitly mark any DELETE
rows and increment the update_ts for this to be brought in:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | DELETE | 2 |
After this, you would then be able to detect the deleted records and adjust accordingly. Your full materialized table view will now look like the following:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | CREATE | 0 |
| key_2 | 2 | CREATE | 0 |
| key_3 | 3 | CREATE | 0 |
| key_1 | 1 | UPDATE | 1 |
| key_2 | 2 | UPDATE | 1 |
| key_1 | 1 | DELETE | 2 |
If you are running incrementally in your raw
ingestion, these rows will not be automatically deleted from your dataset; you'll have to explicitly write logic to detect these deleted records and remove them from your output clean
step. If these deletes are found, you'll have to SNAPSHOT
the output the remove them (unless you're doing lower-level file manipulations where you could remove the underlying file perhaps).
It's worth noting you'll want to materialize the DELETES
as late as possible (assuming your intermediate logic allows for it) since this will require a snapshot and will kill your overall pipeline performance.
If you aren't dealing with JDBC, then @Kellen's answer will apply.