condabioinformaticssnakemake

Snakemake rule fails after doing only 1 job - Sortmerna


I am having trouble in running a Snakemake rule in combination with the package sortmerna.

I have 6 samples in fastq format. I want to process each sample with sortmerna. The rule I am attempting to run is the following:

# Define the directory containing the FASTQ files
single_end_dir = "FASTQ/single_end"

# Define patterns to match specific files
sample_name = glob_wildcards(single_end_dir + "/{sample}.fastq").sample


rule all:
    input:
        expand("results/sortmerna_files/unpaired/rRNA/{sample}.log", sample = sample_name)



rule rna_filtering_not_paired:
    input:
        reads = "FASTQ/single_end/{sample}.fastq",
    output:
        aligned = "results/sortmerna_files/unpaired/rRNA/{sample}.log"
    params:
        aligned = "results/sortmerna_files/unpaired/rRNA/{sample}",
        other = "results/sortmerna_files/unpaired/rRNAf/{sample}",
        threads = 24
    conda:
        "../envs/fastqc.yaml"
    shell:
        """
        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads {input.reads} --aligned {params.aligned} --other {params.other} --workdir /home/oscar/rnaseq --fastx -threads {params.threads} -v --idx-dir ./idx
        
        """

I then run the snakefile with snakemake --use-conda -c 24.

I have tried putting the ./kvdb directory that sortmerna creates in a temporary directory in params of the rule rna_filtering_not_paired, but it did not affect the outcome. It completes one job, but fails the rest.

I suspect the not completion of the jobs is related to this directory, but i am unable to think of another solution.

If i run the shell commands that snakemake outputs one after another, it outputs the expected files, so the problem must lie in the parallel use of kvdb.

The snakemake log outputs the following:

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 24
Rules claiming more threads will be scaled down.
Job stats:
job                         count
------------------------  -------
all                             1
rna_filtering_not_paired        6
total                           7

Select jobs to execute...
Execute 6 jobs...

[Mon Aug  5 10:53:53 2024]
localrule rna_filtering_not_paired:
    input: FASTQ/single_end/Zwt3_02162AAC_GATCAG.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log
    jobid: 2
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log
    wildcards: sample=Zwt3_02162AAC_GATCAG
    resources: tmpdir=/tmp

Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_

[Mon Aug  5 10:53:53 2024]
localrule rna_filtering_not_paired:
    input: FASTQ/single_end/Zwt2_02160AAC_TTAGGC.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
    jobid: 6
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
    wildcards: sample=Zwt2_02160AAC_TTAGGC
    resources: tmpdir=/tmp


Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_
[Mon Aug  5 10:53:53 2024]
localrule rna_filtering_not_paired:
    input: FASTQ/single_end/Zwt1_02158AAC_ATCACG.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log
    jobid: 1
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log
    wildcards: sample=Zwt1_02158AAC_ATCACG
    resources: tmpdir=/tmp

Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_

[Mon Aug  5 10:53:53 2024]
localrule rna_filtering_not_paired:
    input: FASTQ/single_end/Zcr2_02161AAC_CAGATC.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log
    jobid: 5
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log
    wildcards: sample=Zcr2_02161AAC_CAGATC
    resources: tmpdir=/tmp

Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_

[Mon Aug  5 10:53:53 2024]
localrule rna_filtering_not_paired:
    input: FASTQ/single_end/Zcr1_02159AAC_CGATGT.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
    jobid: 4
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
    wildcards: sample=Zcr1_02159AAC_CGATGT
    resources: tmpdir=/tmp

Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_

[Mon Aug  5 10:53:53 2024]
localrule rna_filtering_not_paired:
    input: FASTQ/single_end/Zcr3_02163AAC_AGTTCC.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log
    jobid: 3
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log
    wildcards: sample=Zcr3_02163AAC_AGTTCC
    resources: tmpdir=/tmp

Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_
[Mon Aug  5 10:53:53 2024]
Error in rule rna_filtering_not_paired:
    jobid: 1
    input: FASTQ/single_end/Zwt1_02158AAC_ATCACG.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log
    conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
    shell:
        
        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt1_02158AAC_ATCACG.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG --other results/sortmerna_files/unpaired/rRNAf/Zwt1_02158AAC_ATCACG --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Mon Aug  5 10:53:53 2024]
Error in rule rna_filtering_not_paired:
    jobid: 5
    input: FASTQ/single_end/Zcr2_02161AAC_CAGATC.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log
    conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
    shell:
        
        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr2_02161AAC_CAGATC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC --other results/sortmerna_files/unpaired/rRNAf/Zcr2_02161AAC_CAGATC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Mon Aug  5 10:53:53 2024]
Error in rule rna_filtering_not_paired:
    jobid: 2
    input: FASTQ/single_end/Zwt3_02162AAC_GATCAG.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log
    conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
    shell:
        
        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt3_02162AAC_GATCAG.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG --other results/sortmerna_files/unpaired/rRNAf/Zwt3_02162AAC_GATCAG --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Mon Aug  5 10:53:53 2024]
Error in rule rna_filtering_not_paired:
    jobid: 6
    input: FASTQ/single_end/Zwt2_02160AAC_TTAGGC.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
    conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
    shell:
        
        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt2_02160AAC_TTAGGC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC --other results/sortmerna_files/unpaired/rRNAf/Zwt2_02160AAC_TTAGGC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Mon Aug  5 10:53:53 2024]
Error in rule rna_filtering_not_paired:
    jobid: 3
    input: FASTQ/single_end/Zcr3_02163AAC_AGTTCC.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log
    conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
    shell:
        
        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr3_02163AAC_AGTTCC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC --other results/sortmerna_files/unpaired/rRNAf/Zcr3_02163AAC_AGTTCC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Mon Aug  5 10:54:12 2024]
Error in rule rna_filtering_not_paired:
    jobid: 4
    input: FASTQ/single_end/Zcr1_02159AAC_CGATGT.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
    conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
    shell:
        
        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr1_02159AAC_CGATGT.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT --other results/sortmerna_files/unpaired/rRNAf/Zcr1_02159AAC_CGATGT --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job rna_filtering_not_paired since they might be corrupted:
results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-08-05T105352.555644.snakemake.log
WorkflowError:
At least one job did not complete successfully.

The dry run outputs the following:

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Job stats:
job                         count
------------------------  -------
all                             1
rna_filtering_not_paired        6
total                           7

Execute 6 jobs...

[Mon Aug  5 11:03:45 2024]
rule rna_filtering_not_paired:
    input: FASTQ/single_end/Zcr2_02161AAC_CAGATC.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log
    jobid: 5
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log
    wildcards: sample=Zcr2_02161AAC_CAGATC
    resources: tmpdir=<TBD>


        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr2_02161AAC_CAGATC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC --other results/sortmerna_files/unpaired/rRNAf/Zcr2_02161AAC_CAGATC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        

[Mon Aug  5 11:03:45 2024]
rule rna_filtering_not_paired:
    input: FASTQ/single_end/Zcr3_02163AAC_AGTTCC.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log
    jobid: 3
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log
    wildcards: sample=Zcr3_02163AAC_AGTTCC
    resources: tmpdir=<TBD>


        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr3_02163AAC_AGTTCC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC --other results/sortmerna_files/unpaired/rRNAf/Zcr3_02163AAC_AGTTCC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        

[Mon Aug  5 11:03:45 2024]
rule rna_filtering_not_paired:
    input: FASTQ/single_end/Zwt3_02162AAC_GATCAG.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log
    jobid: 2
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log
    wildcards: sample=Zwt3_02162AAC_GATCAG
    resources: tmpdir=<TBD>


        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt3_02162AAC_GATCAG.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG --other results/sortmerna_files/unpaired/rRNAf/Zwt3_02162AAC_GATCAG --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        

[Mon Aug  5 11:03:45 2024]
rule rna_filtering_not_paired:
    input: FASTQ/single_end/Zwt2_02160AAC_TTAGGC.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
    jobid: 6
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
    wildcards: sample=Zwt2_02160AAC_TTAGGC
    resources: tmpdir=<TBD>


        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt2_02160AAC_TTAGGC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC --other results/sortmerna_files/unpaired/rRNAf/Zwt2_02160AAC_TTAGGC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        

[Mon Aug  5 11:03:45 2024]
rule rna_filtering_not_paired:
    input: FASTQ/single_end/Zwt1_02158AAC_ATCACG.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log
    jobid: 1
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log
    wildcards: sample=Zwt1_02158AAC_ATCACG
    resources: tmpdir=<TBD>


        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt1_02158AAC_ATCACG.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG --other results/sortmerna_files/unpaired/rRNAf/Zwt1_02158AAC_ATCACG --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        

[Mon Aug  5 11:03:45 2024]
rule rna_filtering_not_paired:
    input: FASTQ/single_end/Zcr1_02159AAC_CGATGT.fastq
    output: results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
    jobid: 4
    reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
    wildcards: sample=Zcr1_02159AAC_CGATGT
    resources: tmpdir=<TBD>


        mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
        sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr1_02159AAC_CGATGT.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT --other results/sortmerna_files/unpaired/rRNAf/Zcr1_02159AAC_CGATGT --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
        rm -r ./kvdb
        
Execute 1 jobs...

[Mon Aug  5 11:03:45 2024]
rule all:
    input: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log, results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log, results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log, results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log, results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log, results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
    jobid: 0
    reason: Rules with a run or shell declaration but no output are always executed.
    resources: tmpdir=<TBD>

echo "I just run subrules!"
Job stats:
job                         count
------------------------  -------
all                             1
rna_filtering_not_paired        6
total                           7

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        all
    output files have to be generated:
        rna_filtering_not_paired
    run or shell but no output:
        all

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Solution

  • Assuming the problem is what you suspect - that multiple copies of sortmerna running in the same directory at the same time interfere with each other - Snakemake has a general solution for this.

    https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#shadow-rules

    Specifically, try using shadow: "minimal" for your rule. The link above explains this better than I can. The use of a shadow directory does make the logs look a little more complex, but the advantages mean that I'd actually advocate making this the default for all rules unless there is a good reason not to. In NextFlow shadow directories are the default (or maybe even mandatory - I can't recall).

    I'll also mention that there is a wrapper/helper available for sortmerna:

    https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/sortmerna.html

    If this works, you can use this and leave the fiddly details of running the application to the wrapper. But of course, if there are problems, you are now stuck debugging the wrapper code and it may just be easier to fix the rule as you have it with a shell command.