Conditional iteration with snakemake and checkpoint

Issue

For my workflow I need to repeat a certain number of rules multiple times, the idea is that with each of these iteration the output gets better and better.

I had issue with adding new jobs to the DAG (iterate again if the output is not satisfactory), so I went for having all jobs to be done up to the maximum number of iteration and trying to bypass the execution of the jobs by creating their output

The issue is that when I want to stop iterating (for example when I know the result cannot become better), I use 'touch' on all future output inside the shell block of a checkpoint so that snakemake can avoid executing all rules with the newly outputs created and skip right to the end after the DAG re-evaluation.

Failed attempt

I tried many failed ways to resolve the conditional iteration which clearly snakemake is not made for (adding 'ancient' to all input, using '--touch'). So far what worked best for me is the following example.

We use fake rule just to move files around. The idea is that rule 'test_4' produce the end file of a given iteration, then the next iteration can use it to run the next iteration loop such as for any step 'n':

1^(n-1) -> 2^(n-1) -> 3a^(n-1) -> 3b^(n-1) -> 4^(n-1) -> 1⁽ⁿ⁾ -> 2⁽ⁿ⁾ -> 3a⁽ⁿ⁾ -> 3b⁽ⁿ⁾ -> 4⁽ⁿ⁾ -> ...

Code: example.smk

# Example with a loop workflow aiming to produce 3 iteration stopping after checkpoint.

# For testing create a directory structure for testing where the example.smk is:
## mkdir -p "testing/init" ; touch "testing/init/start.done"

rule all:
    input:
        "testing/step_2/end.done"

def get_input(wildcards):
    """
    For iteration 0, uses starting file. For subsequent iterations, uses
    the end from the previous iteration.
    """
    iteration_num = int(wildcards.iteration)
    if iteration_num == 0:
        return "testing/init/start.done"
    else:
        # iteration - 1 to use the end from previous iteration loop
        return f"testing/step_{iteration_num - 1}/end.done"

rule test_1:
    input:
        get_input
    output:
        "testing/step_{iteration}/1.done"
    shell:
        "cp {input} {output}"

# checkpoint update DAG of jobs after completion.
checkpoint test_2:
    input:
        "testing/step_{iteration}/1.done"
    output:
        "testing/step_{iteration}/2.done"
    threads: 1
    shell:
        """
        cp {input} {output}
        # Here we touch the output of rule 3A and 3B in order to bypass them
        touch "testing/step_{wildcards.iteration}/3.done"
        touch "testing/step_{wildcards.iteration}/3b.done"
        """

rule test_3_AAAAA:
    input:
        "testing/step_{iteration}/2.done"
    output:
        "testing/step_{iteration}/3.done"
    threads: 1
    shell:
        """
        cp {input} {output}
        """

rule test_3_BBBBB:
    input:
        "testing/step_{iteration}/3.done"
    output:
        "testing/step_{iteration}/3b.done"
    threads: 1
    shell:
        """
        cp {input} {output}
        """

rule test_4:
    input:
        "testing/step_{iteration}/3b.done"
    output:
        "testing/step_{iteration}/end.done"
    threads: 1
    shell:
        """
        cp {input} {output}
        """

But upon execution the rule 3A is correctly skipped but 3B is rerun for the reason "Input files updated by another job: testing/step_0/3.done"

See detailed log below:

log: example.log

# FYI: $ snakemake --version
# >>> 9.1.1


Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job             count
------------  -------
all                 1
test_1              3
test_2              3
test_3_AAAAA        3
test_3_BBBBB        3
test_4              3
total              16

Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:10 2025]
localrule test_1:
    input: testing/init/start.done
    output: testing/step_0/1.done
    jobid: 15
    reason: Missing output files: testing/step_0/1.done
    wildcards: iteration=0
    resources: tmpdir=/tmp

Waiting for more resources.
[Mon Aug  4 14:22:10 2025]
Finished jobid: 15 (Rule: test_1)
1 of 16 steps (6%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:10 2025]
localcheckpoint test_2:
    input: testing/step_0/1.done
    output: testing/step_0/2.done
    jobid: 14
    reason: Missing output files: testing/step_0/2.done; Input files updated by another job: testing/step_0/1.done
    wildcards: iteration=0
    resources: tmpdir=/tmp

DAG of jobs will be updated after completion.
Waiting for more resources.
[Mon Aug  4 14:22:11 2025]
Finished jobid: 14 (Rule: test_2)
2 of 16 steps (12%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:11 2025]
localrule test_3_BBBBB:
    input: testing/step_0/3.done
    output: testing/step_0/3b.done
    jobid: 12
    reason: Missing output files: testing/step_0/3b.done; Input files updated by another job: testing/step_0/3.done
    wildcards: iteration=0
    resources: tmpdir=/tmp

Warning: the following output files of rule test_3_BBBBB were not present when the DAG was created:
{'testing/step_0/3b.done'}
Waiting for more resources.
[Mon Aug  4 14:22:11 2025]
Finished jobid: 12 (Rule: test_3_BBBBB)
3 of 15 steps (20%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:11 2025]
localrule test_4:
    input: testing/step_0/3b.done
    output: testing/step_0/end.done
    jobid: 11
    reason: Missing output files: testing/step_0/end.done; Input files updated by another job: testing/step_0/3b.done
    wildcards: iteration=0
    resources: tmpdir=/tmp

Waiting for more resources.
[Mon Aug  4 14:22:11 2025]
Finished jobid: 11 (Rule: test_4)
4 of 15 steps (27%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:11 2025]
localrule test_1:
    input: testing/step_0/end.done
    output: testing/step_1/1.done
    jobid: 10
    reason: Missing output files: testing/step_1/1.done; Input files updated by another job: testing/step_0/end.done
    wildcards: iteration=1
    resources: tmpdir=/tmp

Waiting for more resources.
[Mon Aug  4 14:22:12 2025]
Finished jobid: 10 (Rule: test_1)
5 of 15 steps (33%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:12 2025]
localcheckpoint test_2:
    input: testing/step_1/1.done
    output: testing/step_1/2.done
    jobid: 9
    reason: Missing output files: testing/step_1/2.done; Input files updated by another job: testing/step_1/1.done
    wildcards: iteration=1
    resources: tmpdir=/tmp

DAG of jobs will be updated after completion.
Waiting for more resources.
[Mon Aug  4 14:22:12 2025]
Finished jobid: 9 (Rule: test_2)
6 of 15 steps (40%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:12 2025]
localrule test_3_BBBBB:
    input: testing/step_1/3.done
    output: testing/step_1/3b.done
    jobid: 7
    reason: Missing output files: testing/step_1/3b.done; Input files updated by another job: testing/step_1/3.done
    wildcards: iteration=1
    resources: tmpdir=/tmp

Warning: the following output files of rule test_3_BBBBB were not present when the DAG was created:
{'testing/step_1/3b.done'}
Waiting for more resources.
[Mon Aug  4 14:22:12 2025]
Finished jobid: 7 (Rule: test_3_BBBBB)
7 of 14 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:12 2025]
localrule test_4:
    input: testing/step_1/3b.done
    output: testing/step_1/end.done
    jobid: 6
    reason: Missing output files: testing/step_1/end.done; Input files updated by another job: testing/step_1/3b.done
    wildcards: iteration=1
    resources: tmpdir=/tmp

Waiting for more resources.
[Mon Aug  4 14:22:12 2025]
Finished jobid: 6 (Rule: test_4)
8 of 14 steps (57%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:12 2025]
localrule test_1:
    input: testing/step_1/end.done
    output: testing/step_2/1.done
    jobid: 5
    reason: Missing output files: testing/step_2/1.done; Input files updated by another job: testing/step_1/end.done
    wildcards: iteration=2
    resources: tmpdir=/tmp

Waiting for more resources.
[Mon Aug  4 14:22:13 2025]
Finished jobid: 5 (Rule: test_1)
9 of 14 steps (64%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:13 2025]
localcheckpoint test_2:
    input: testing/step_2/1.done
    output: testing/step_2/2.done
    jobid: 4
    reason: Missing output files: testing/step_2/2.done; Input files updated by another job: testing/step_2/1.done
    wildcards: iteration=2
    resources: tmpdir=/tmp

DAG of jobs will be updated after completion.
Waiting for more resources.
[Mon Aug  4 14:22:13 2025]
Finished jobid: 4 (Rule: test_2)
10 of 14 steps (71%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:13 2025]
localrule test_3_BBBBB:
    input: testing/step_2/3.done
    output: testing/step_2/3b.done
    jobid: 2
    reason: Missing output files: testing/step_2/3b.done; Input files updated by another job: testing/step_2/3.done
    wildcards: iteration=2
    resources: tmpdir=/tmp

Warning: the following output files of rule test_3_BBBBB were not present when the DAG was created:
{'testing/step_2/3b.done'}
Waiting for more resources.
[Mon Aug  4 14:22:13 2025]
Finished jobid: 2 (Rule: test_3_BBBBB)
11 of 13 steps (85%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:13 2025]
localrule test_4:
    input: testing/step_2/3b.done
    output: testing/step_2/end.done
    jobid: 1
    reason: Missing output files: testing/step_2/end.done; Input files updated by another job: testing/step_2/3b.done
    wildcards: iteration=2
    resources: tmpdir=/tmp

Waiting for more resources.
[Mon Aug  4 14:22:13 2025]
Finished jobid: 1 (Rule: test_4)
12 of 13 steps (92%) done
Select jobs to execute...
Execute 1 jobs...

[Mon Aug  4 14:22:13 2025]
localrule all:
    input: testing/step_2/end.done
    jobid: 0
    reason: Input files updated by another job: testing/step_2/end.done
    resources: tmpdir=/tmp

Waiting for more resources.
[Mon Aug  4 14:22:13 2025]
Finished jobid: 0 (Rule: all)
13 of 13 steps (100%) done

I also get the "Warning: the following output files of rule test_3_BBBBB were not present when the DAG was created:{'testing/step_0/3b.done'}" Which I quite don't get, since all other output were not there but he seemed to have an issue with that one and not 'end.done' for example.

So if anyone has an idea how to do that kind of iteration ? Or in this case avoid any run of a rule unless the output is missing and only that reason.

I'm considering having a script run each iteration so I can make it stop when I want but it bypass the whole idea of letting snakemake managed resources and jobs. That would be a last resort option.

Edit

Thanks to Bli's answered I was able to rework my code and it look like something along the line of :

# Example with a loop workflow aiming to produce 3 iteration stopping after checkpoint.

# For testing create a directory structure for testing where the example.smk is:
## mkdir -p "testing/init" ; touch "testing/init/start.done"

rule all:
    input:
        "testing/step_2/end.done"

def get_input(wildcards):
    """
    For iteration 0, uses starting file. For subsequent iterations, uses
    the end from the previous iteration.
    """
    iteration_num = int(wildcards.iteration)
    if iteration_num == 0:
        return "testing/init/start.done"
    else:
        # iteration - 1 to use the end from previous iteration loop
        return f"testing/step_{iteration_num - 1}/end.done"

rule test_1:
    input:
        get_input
    output:
        "testing/step_{iteration}/1.done"
    shell:
        "cp {input} {output}"

# checkpoint update DAG of jobs after completion.
checkpoint test_2:
    input:
        "testing/step_{iteration}/1.done"
    output:
        "testing/step_{iteration}/2.done"
    threads: 1
    shell:
        """
        cp {input} {output}
        """

rule test_3_AAAAA:
    input:
        "testing/step_{iteration}/2.done"
    output:
        "testing/step_{iteration}/3.done"
    threads: 1
    shell:
        """
        cp {input} {output}
        """

rule test_3_BBBBB:
    input:
        "testing/step_{iteration}/3.done"
    output:
        "testing/step_{iteration}/3b.done"
    threads: 1
    shell:
        """
        cp {input} {output}
        """

def get_checkpoint_result(wildcards):
    checkpoint_out = checkpoints.test_2.get(iteration=wildcards.iteration).output[0]
    if wildcards.iteration = 0:
        return f"testing/step_{wildcards.iteration}/3b.done"
    else:
        bypass_iteration = True # would be a fct to evaluate if rules need to be computed
        if bypass_iteration:
            return checkpoint_out
        else:
            return f"testing/step_{wildcards.iteration}/3b.done"

rule test_4:
    input:
        get_checkpoint_result
    output:
        "testing/step_{iteration}/end.done"
    threads: 1
    shell:
        """
        cp {input} {output}
        """

Solution

When the dag is computed, the only things taken into account are the "official" inputs and outputs of rules.

When you do the following in checkpoint "test_2":

# Here we touch the output of rule 3A and 3B in order to bypass them
touch "testing/step_{wildcards.iteration}/3.done"
touch "testing/step_{wildcards.iteration}/3b.done"

Snakemake doesn't know that "test_2" can provide such an output, because it only lists "testing/step_{iteration}/2.done" as output.

Also, I don't see any rule using an input function checking the output of this checkpoint, so I'm not surprised that it doesn't behave as you hoped.

If I understand the documentation (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution), one should use an official output file of the checkpoint to compute the required input of a downstream "collecting" rule.

Therefore, one suggestion would be to add an extra output to your checkpoint, containing information enabling the downstream rule to know whether it only needs "testing/step_{iteration}/2.done" or if the output(s) of "test_3_AAAAA" are also needed.

I'm wondering whether, with this approach, it would be possible to do this with just one "recursive" checkpoint (instead of one rule for each iteration), having an iteration wildcard, and whose input would check the output of the previous iteration to decide whether it should generate the final output or not. Maybe better to get things working with a fixed maximum number of iterations, though...