snakemakefastq

Snakemake How to design downstream rules on empty output files?


I am wondering how to deal with empty output files in downstream rules. An assembly of a short read fastq data with SHOVILL can fail and produce a 0-byte contigs.fa

If genome annotation with PROKKA is run on a 0-byte file it returns an error:

 'input.fasta' is not a readable non-empty FASTA file

Snakemake rules for assembly and annotation :

rule shovill:
    input:
        fw="input/{fastq}_R1.fastq.gz",
        rv="input/{fastq}_R2.fastq.gz"
    output:
        directory("results/shovill/{sample}")
    threads: 16
    shell:
        "shovill --cpus {threads} --R1 {input.fw} --R2 {input.rv} --outdir {output}\
            2> {log}"

My current solution is to use || true and generate the result directory preliminary.

rule prokka:
    input:
        "results/shovill/{sample}/contigs.fa"
    output:
        directory("results/prokka/{sample}")
    threads: 8
    params:
        prefix="{sample}",
        gcode=config["genetic_code"],
        outdir="results/prokka/{sample}"
    shell:
        """
        mkdir -p {params.outdir}
        prokka --cpus {threads}  --force {input} &> {log} || true
        """

Solution

  • I can think of two approaches, neither is perfect.

    The first is basically what you did: use bash to work around. However, I would suggest the -s file test operator. This way, you still get notified of a genuine error from prokka:

    shell:
        """
        if [ -s {input} ]; then
           prokka --cpus {threads}  --force {input} > {log} 
        else;
           mkdir -p {params.outdir}
        """
    

    Alternatively, use checkpoints. This puts all the logic in snakemake, but it's a little cumbersome. Something like this, but I'm not sure if this is 100% right, or the best version:

    checkpoint gather_outputs:
        input: expand("results/shovill/{sample}/contigs.fa", sample=samples)
        output: 'results/shovill/non_empty.list'
        shell: 
            with open(str(output), 'wt') as output_handle:
                for input_file in input:
                    if os.path.exists(input_file) and os.path.getsize(input_file):
                        sample = os.path.basename(os.path.dirname(input_file))
                        output_handle.write(f"{sample}\n")
    
    def get_prokka_outputs(wildcards):
        # this makes the rest of the workflow wait for the checkpoint
        chk_out = checkpoints.gather_outputs.get().output
        with open(chk_out) as sample_lines:
            samples = [line.strip() for line in sample_lines]
    
        return expand("results/prokka/{sample}", sample=samples)
    

    You might be able to do a per-sample checkpoint, but I've never tried it.

    The more I think about it, checkpoints are too much overhead for your situation.