palantir-foundryfoundry-data-connection

How does Foundry Magritte append ingestion handle deleted rows in the data source?


If I have a Magritte ingestion that is set to append, will it detect if rows are deleted in the source data? Will it also delete the rows in the ingested dataset?


Solution

  • For your first question on if deletions are detected, this will depend on the database implementation you are extracting from (I'll assume this is JDBC for this answer). If this shows up as a modification and therefore a new row, then yes your deletes will show up.

    This would look something like the following at first:

    | primary_key | val | update_type | update_ts |
    |-------------|-----|-------------|-----------|
    | key_1       | 1   | CREATE      | 0         |
    | key_2       | 2   | CREATE      | 0         |
    | key_3       | 3   | CREATE      | 0         |
    

    Followed by some updates (inside a subsequent run, incremental on update_ts:

    | primary_key | val | update_type | update_ts |
    |-------------|-----|-------------|-----------|
    | key_1       | 1   | UPDATE      | 1         |
    | key_2       | 2   | UPDATE      | 1         |
    

    Now your database would have to explicitly mark any DELETE rows and increment the update_ts for this to be brought in:

    | primary_key | val | update_type | update_ts |
    |-------------|-----|-------------|-----------|
    | key_1       | 1   | DELETE      | 2         |
    

    After this, you would then be able to detect the deleted records and adjust accordingly. Your full materialized table view will now look like the following:

    | primary_key | val | update_type | update_ts |
    |-------------|-----|-------------|-----------|
    | key_1       | 1   | CREATE      | 0         |
    | key_2       | 2   | CREATE      | 0         |
    | key_3       | 3   | CREATE      | 0         |
    | key_1       | 1   | UPDATE      | 1         |
    | key_2       | 2   | UPDATE      | 1         |
    | key_1       | 1   | DELETE      | 2         |
    

    If you are running incrementally in your raw ingestion, these rows will not be automatically deleted from your dataset; you'll have to explicitly write logic to detect these deleted records and remove them from your output clean step. If these deletes are found, you'll have to SNAPSHOT the output the remove them (unless you're doing lower-level file manipulations where you could remove the underlying file perhaps).

    It's worth noting you'll want to materialize the DELETES as late as possible (assuming your intermediate logic allows for it) since this will require a snapshot and will kill your overall pipeline performance.

    If you aren't dealing with JDBC, then @Kellen's answer will apply.