In a 2 step workflow, if I run the workflow to completion, then realize I must re-run step 1, I expect that if I run step 1 with --force that step 2 will be run on the next invocation of the whole workflow.
Minimal example
Given this snakefile called bad.snakefile
:
rule runall:
input: "bye.txt"
output: "hello.txt"
shell: "echo hello > {output}"
rule realoutput:
output: "bye.txt"
shell: "echo bye > {output}"
snakemake --snakefile bad.snakefile --cores 1
. It works successfully and runs both steps.snakemake --snakefile bad.snakefile --cores 1 --force realoutput
. This successfully runs the first step.snakemake --snakefile bad.snakefile --cores 1
. This should cause Step 2 to be re-run. However, it says everything is complete. But it is clear that the mtime for step 1 output is greater than for step 2 output. This is not fixed by adding --rerun-triggers=mtime
.The summary command shows the timestamps are not as expected for a workflow saying "Nothing to be done (all requested files are present and up to date).":
$> snakemake --snakefile bad.snakefile --cores 1 --summary
Building DAG of jobs...
output_file date rule log-file(s) status plan
hello.txt Thu Sep 19 11:25:20 2024 runall ok no update
bye.txt Thu Sep 19 11:30:04 2024 realoutput ok no update
It works correctly if I do snakemake --snakefile bad.snakefile --cores 1 --forcerun realoutput
. However, this forces me to run all the steps at once. I wish to re-run step 1 and then have a chance to check the outcome. Then, I want to proceed with the re-run of the dependent steps.
I just tested and I think you are encountering the same issue as explained here.
Running through such steps with a pipeline, albeit toy or otherwise, with tiny, small files involved can result in confusion if you expect the timestamp to be respected in the manner you'd predict. And so, you have to change things or workaround it.
The way I set up to test it was the size of the output was to first run the following two commands with your exact code in the Snakefile:
snakemake -c1
snakemake -c1 realoutput --force
Then when I ran snakemake -c1 --summary
and snakemake -c1
, I saw what you saw. The timestamps showed bye.txt
being made later, but snakemake didn't process.
So establishing that, now to test if it was the size I altered the shell: "echo bye > {output}"
line in realoutput
rule of the Snakefile to the following:
shell: "cat index.ipynb > {output}"
(index.ipynb
was a good size file I had handy; use whatever you want that is good size to try this yourself.)
Then I went the same steps with snakemake, literally using the arrow keys on the command line to run the same commands, and now it wanted to update again if I just ran realoutput
. I saw update pending
if I included --summary
and running snakemake -c1
ran the runall
rule again now.
There Tim Booth talks about more workarounds if you insist on using the tiny data for your toy pipeline. Or have an actual task that requires tiny file use.