palantir-foundrydata-pipelinefoundry-code-repositories

Resource Allocation for Incremental Pipelines


There are times when an incremental pipeline in Palantir Foundry has to be built as a snapshot. If the data size is large, the resources to run the build are increased to reduce run time and then the configuration is removed after first snapshot run. Is there a way to set conditional configuration? Like if pipeline is running on Incremental Mode, use default configuration of resource allocation and if not the specified set of resources.

Example: If pipeline runs as snapshot transaction, below configuration has to be applied

@configure(profile=["NUM_EXECUTORS_8", "EXECUTOR_MEMORY_MEDIUM", "DRIVER_MEMORY_MEDIUM"]) 

If incremental, then the default one.


Solution

  • The @configure and @incremental are set during the CI execution, while the actual code inside the function annotated by @transform_df or `@transform happens at build time.

    This means that you can't programatically switch between them after the CI has passed. What you can do however is have a constant or configuration within your repo, and switch at code level whenever you want to switch these. Please make sure you understand how semantic versioning works before attempting this I.e.:

    IS_INCREMENTAL = true
    SEMANTIC_VERSION=1
    
    def mytransform(input1, input2,...)
       return input1.join(input2, "foo", left)
    
    
    if IS_INCREMENTAL:
       @incremental(semantic_version=SEMANTIC_VERSION)
       @transform_df(
         Output("foo"),
         input1=Input("bar"),
         input2=Input("foobar"))
       def compute(input1, input2):
          return mytransform(input1, input2)
    else:
       @configure(profile=["NUM_EXECUTORS_8", "EXECUTOR_MEMORY_MEDIUM", "DRIVER_MEMORY_MEDIUM"]) 
       @transform_df(
         Output("foo"),
         input1=Input("bar"),
         input2=Input("foobar"))
       def compute(input1, input2):
          return mytransform(input1, input2)