I am wondering how to deal with empty output files in downstream rules. An assembly of a short read fastq data with SHOVILL can fail and produce a 0-byte contigs.fa
If genome annotation with PROKKA is run on a 0-byte file it returns an error:
'input.fasta' is not a readable non-empty FASTA file
Snakemake rules for assembly and annotation :
rule shovill:
input:
fw="input/{fastq}_R1.fastq.gz",
rv="input/{fastq}_R2.fastq.gz"
output:
directory("results/shovill/{sample}")
threads: 16
shell:
"shovill --cpus {threads} --R1 {input.fw} --R2 {input.rv} --outdir {output}\
2> {log}"
My current solution is to use || true and generate the result directory preliminary.
rule prokka:
input:
"results/shovill/{sample}/contigs.fa"
output:
directory("results/prokka/{sample}")
threads: 8
params:
prefix="{sample}",
gcode=config["genetic_code"],
outdir="results/prokka/{sample}"
shell:
"""
mkdir -p {params.outdir}
prokka --cpus {threads} --force {input} &> {log} || true
"""
I can think of two approaches, neither is perfect.
The first is basically what you did: use bash to work around. However, I would suggest the -s
file test operator. This way, you still get notified of a genuine error from prokka:
shell:
"""
if [ -s {input} ]; then
prokka --cpus {threads} --force {input} > {log}
else;
mkdir -p {params.outdir}
"""
Alternatively, use checkpoints. This puts all the logic in snakemake, but it's a little cumbersome. Something like this, but I'm not sure if this is 100% right, or the best version:
checkpoint gather_outputs:
input: expand("results/shovill/{sample}/contigs.fa", sample=samples)
output: 'results/shovill/non_empty.list'
shell:
with open(str(output), 'wt') as output_handle:
for input_file in input:
if os.path.exists(input_file) and os.path.getsize(input_file):
sample = os.path.basename(os.path.dirname(input_file))
output_handle.write(f"{sample}\n")
def get_prokka_outputs(wildcards):
# this makes the rest of the workflow wait for the checkpoint
chk_out = checkpoints.gather_outputs.get().output
with open(chk_out) as sample_lines:
samples = [line.strip() for line in sample_lines]
return expand("results/prokka/{sample}", sample=samples)
You might be able to do a per-sample checkpoint, but I've never tried it.
The more I think about it, checkpoints are too much overhead for your situation.