pythonsnakemake

How to prevent rule execution due to missing benchmark file in Snakemake?


I'm using Snakemake version 8.23.2 and encountering unexpected behavior related to benchmark files. I want to prevent rules from being re-executed solely due to missing benchmark files.

Snakefile

from datetime import datetime
NOW = datetime.now().strftime("%Y%m%d-%H%M%S")


rule all:
    input:
        "test"

rule test:
    output:
        "test"
    benchmark:
        "benchmark/" + NOW + ".tsv"
    shell:
        """
sleep 3
touch {output}
        """

Unexpected Behavior

When I run snakemake test, the rule is executed on the first run as expected. However, on subsequent runs, it's triggered again due to the missing benchmark file:

❯ snakemake test
[...]
[Fri Oct 25 15:22:20 2024]
localrule test:
    output: test
    jobid: 0
    benchmark: benchmark/20241025-152220.tsv
    reason: Missing output files: benchmark/20241025-152220.tsv
    resources: tmpdir=/tmp
[...]

Expected Behavior

I expect Snakemake to consider the rule up-to-date if the main output file exists, regardless of the benchmark file's presence:

❯ snakemake all
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).

What I've Tried

I've attempted to use the --rerun-triggers option, but it doesn't seem to resolve this issue.

Use Case

I often run snakemake $(snakemake --list-target-rules) to trigger all target rules. Some target rules may have already succeeded in previous runs, and I don't want them to re-execute just because of missing benchmark files.

Is there a way to configure Snakemake to ignore missing benchmark files when determining whether a rule needs to be re-run?


Solution

  • The short answer is "no".

    The Snakemake logic when given a target rule name, as opposed to a target file name, is to add all the outputs, logs and benchmarks of the rule as target files that need to be generated. Arguably benchmarks should not be included, but they are. By having a time-dependent name for the benchmark file you are therefore guaranteeing to have the rules re-running every time you run snakemake $(snakemake --list-target-rules), regardless of how you set the rerun triggers.

    Note that, in your sample code, the name of the output file is the same as the name of the rule "test", so when you run snakemake test then Snakemake will consider test to be the name of the rule, and it will therefore add the benchmark file to its list of output targets, but if the rule (or the file) is renamed you can ask for that specific file, and this does not trigger a re-run if the rule.

    I think what you need to do is have your all rule explicitly listing all of your output files for all the target rules and then just run $ snakemake all. It's not as convenient as being able to run snakemake $(snakemake --list-target-rules) but I think with your current way of saving benchmarks that is what you'll need to do.

    Unless you do something hideous like:

    $ snakemake -n $(snakemake --list-target-rules) | grep -Po '(?<=^    output: ).+'
    

    but don't do that!