palantir-foundrydata-lineage

Automatically updating a dataset on Palantir Foundry when any of the immediate upstream resources updates


we have a lot of different datasets produced by various transforms. Our goal is to make sure to always update each dataset in case one of the immediate upstream datasets (transform inputs) changed. Since we have such a high number of datasets and also the interconnections can change, maintaining these pipeline schedules all manually would be a high effort that we ideally would like to avoid.

We are wondering if there is a way to specify on a dataset or transform to basically monitor if any of the input resources has updated/ changed and if this is the case to then automatically run the transform to also update the output dataset? Of course this is possible by manually maintaining the pipeline schedules on the data lineage, but we were looking for an approach with less manual effort. Maybe there exists an annotation for the transform (e.g @transform_upstream) or a flag on the dataset in the lineage we can set? We are aware of the option of creating a separate build schedule and then adding all the input datasets on an advanced build rule (however this feels somewhat redundant).

enter image description here

To make a specific example in this regard, lets image we want to update "flight_alerts_clean" whenever one if its immediate inputs (priority_mapping_preprocessed, status_mapping_preprocessed, flight_alerts_preprocessed) update (without having to hardcode any of these in a build schedule) enter image description here

Thank you very much for your support


Solution

  • From original question it is unclear the type of Schedule adopted by OP.

    There is no automatic input detection and update for schedules. But, nonetheless I would suggest adoption of Connecting Builds.

    In connecting build you mark the following types:

    Inputs are only providing data, Trigger are physically deciding the build' starting condition (you can configure Input + Trigger on any desired RID).

    All intermediate datasets will be automatically marked as "Will attempt to build", hence automatically triggered in the flow.

    Since there, you have "only" the burden of updating the schedule in order to identify the triggering datasets whenever you modify your codebase.