I am having trouble in running a Snakemake rule in combination with the package sortmerna.
I have 6 samples in fastq format. I want to process each sample with sortmerna. The rule I am attempting to run is the following:
# Define the directory containing the FASTQ files
single_end_dir = "FASTQ/single_end"
# Define patterns to match specific files
sample_name = glob_wildcards(single_end_dir + "/{sample}.fastq").sample
rule all:
input:
expand("results/sortmerna_files/unpaired/rRNA/{sample}.log", sample = sample_name)
rule rna_filtering_not_paired:
input:
reads = "FASTQ/single_end/{sample}.fastq",
output:
aligned = "results/sortmerna_files/unpaired/rRNA/{sample}.log"
params:
aligned = "results/sortmerna_files/unpaired/rRNA/{sample}",
other = "results/sortmerna_files/unpaired/rRNAf/{sample}",
threads = 24
conda:
"../envs/fastqc.yaml"
shell:
"""
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads {input.reads} --aligned {params.aligned} --other {params.other} --workdir /home/oscar/rnaseq --fastx -threads {params.threads} -v --idx-dir ./idx
"""
I then run the snakefile with snakemake --use-conda -c 24
.
I have tried putting the ./kvdb directory that sortmerna creates in a temporary directory in params of the rule rna_filtering_not_paired, but it did not affect the outcome. It completes one job, but fails the rest.
I suspect the not completion of the jobs is related to this directory, but i am unable to think of another solution.
If i run the shell commands that snakemake outputs one after another, it outputs the expected files, so the problem must lie in the parallel use of kvdb.
The snakemake log outputs the following:
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 24
Rules claiming more threads will be scaled down.
Job stats:
job count
------------------------ -------
all 1
rna_filtering_not_paired 6
total 7
Select jobs to execute...
Execute 6 jobs...
[Mon Aug 5 10:53:53 2024]
localrule rna_filtering_not_paired:
input: FASTQ/single_end/Zwt3_02162AAC_GATCAG.fastq
output: results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log
jobid: 2
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log
wildcards: sample=Zwt3_02162AAC_GATCAG
resources: tmpdir=/tmp
Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_
[Mon Aug 5 10:53:53 2024]
localrule rna_filtering_not_paired:
input: FASTQ/single_end/Zwt2_02160AAC_TTAGGC.fastq
output: results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
jobid: 6
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
wildcards: sample=Zwt2_02160AAC_TTAGGC
resources: tmpdir=/tmp
Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_
[Mon Aug 5 10:53:53 2024]
localrule rna_filtering_not_paired:
input: FASTQ/single_end/Zwt1_02158AAC_ATCACG.fastq
output: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log
jobid: 1
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log
wildcards: sample=Zwt1_02158AAC_ATCACG
resources: tmpdir=/tmp
Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_
[Mon Aug 5 10:53:53 2024]
localrule rna_filtering_not_paired:
input: FASTQ/single_end/Zcr2_02161AAC_CAGATC.fastq
output: results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log
jobid: 5
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log
wildcards: sample=Zcr2_02161AAC_CAGATC
resources: tmpdir=/tmp
Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_
[Mon Aug 5 10:53:53 2024]
localrule rna_filtering_not_paired:
input: FASTQ/single_end/Zcr1_02159AAC_CGATGT.fastq
output: results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
jobid: 4
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
wildcards: sample=Zcr1_02159AAC_CGATGT
resources: tmpdir=/tmp
Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_
[Mon Aug 5 10:53:53 2024]
localrule rna_filtering_not_paired:
input: FASTQ/single_end/Zcr3_02163AAC_AGTTCC.fastq
output: results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log
jobid: 3
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log
wildcards: sample=Zcr3_02163AAC_AGTTCC
resources: tmpdir=/tmp
Activating conda environment: .snakemake/conda/172f44aa594738803b665ab48840e734_
[Mon Aug 5 10:53:53 2024]
Error in rule rna_filtering_not_paired:
jobid: 1
input: FASTQ/single_end/Zwt1_02158AAC_ATCACG.fastq
output: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log
conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
shell:
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt1_02158AAC_ATCACG.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG --other results/sortmerna_files/unpaired/rRNAf/Zwt1_02158AAC_ATCACG --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[Mon Aug 5 10:53:53 2024]
Error in rule rna_filtering_not_paired:
jobid: 5
input: FASTQ/single_end/Zcr2_02161AAC_CAGATC.fastq
output: results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log
conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
shell:
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr2_02161AAC_CAGATC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC --other results/sortmerna_files/unpaired/rRNAf/Zcr2_02161AAC_CAGATC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[Mon Aug 5 10:53:53 2024]
Error in rule rna_filtering_not_paired:
jobid: 2
input: FASTQ/single_end/Zwt3_02162AAC_GATCAG.fastq
output: results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log
conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
shell:
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt3_02162AAC_GATCAG.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG --other results/sortmerna_files/unpaired/rRNAf/Zwt3_02162AAC_GATCAG --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[Mon Aug 5 10:53:53 2024]
Error in rule rna_filtering_not_paired:
jobid: 6
input: FASTQ/single_end/Zwt2_02160AAC_TTAGGC.fastq
output: results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
shell:
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt2_02160AAC_TTAGGC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC --other results/sortmerna_files/unpaired/rRNAf/Zwt2_02160AAC_TTAGGC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[Mon Aug 5 10:53:53 2024]
Error in rule rna_filtering_not_paired:
jobid: 3
input: FASTQ/single_end/Zcr3_02163AAC_AGTTCC.fastq
output: results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log
conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
shell:
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr3_02163AAC_AGTTCC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC --other results/sortmerna_files/unpaired/rRNAf/Zcr3_02163AAC_AGTTCC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[Mon Aug 5 10:54:12 2024]
Error in rule rna_filtering_not_paired:
jobid: 4
input: FASTQ/single_end/Zcr1_02159AAC_CGATGT.fastq
output: results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
conda-env: /home/oscar/rnaseq/.snakemake/conda/172f44aa594738803b665ab48840e734_
shell:
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr1_02159AAC_CGATGT.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT --other results/sortmerna_files/unpaired/rRNAf/Zcr1_02159AAC_CGATGT --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job rna_filtering_not_paired since they might be corrupted:
results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-08-05T105352.555644.snakemake.log
WorkflowError:
At least one job did not complete successfully.
The dry run outputs the following:
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Job stats:
job count
------------------------ -------
all 1
rna_filtering_not_paired 6
total 7
Execute 6 jobs...
[Mon Aug 5 11:03:45 2024]
rule rna_filtering_not_paired:
input: FASTQ/single_end/Zcr2_02161AAC_CAGATC.fastq
output: results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log
jobid: 5
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log
wildcards: sample=Zcr2_02161AAC_CAGATC
resources: tmpdir=<TBD>
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr2_02161AAC_CAGATC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC --other results/sortmerna_files/unpaired/rRNAf/Zcr2_02161AAC_CAGATC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
[Mon Aug 5 11:03:45 2024]
rule rna_filtering_not_paired:
input: FASTQ/single_end/Zcr3_02163AAC_AGTTCC.fastq
output: results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log
jobid: 3
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log
wildcards: sample=Zcr3_02163AAC_AGTTCC
resources: tmpdir=<TBD>
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr3_02163AAC_AGTTCC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC --other results/sortmerna_files/unpaired/rRNAf/Zcr3_02163AAC_AGTTCC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
[Mon Aug 5 11:03:45 2024]
rule rna_filtering_not_paired:
input: FASTQ/single_end/Zwt3_02162AAC_GATCAG.fastq
output: results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log
jobid: 2
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log
wildcards: sample=Zwt3_02162AAC_GATCAG
resources: tmpdir=<TBD>
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt3_02162AAC_GATCAG.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG --other results/sortmerna_files/unpaired/rRNAf/Zwt3_02162AAC_GATCAG --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
[Mon Aug 5 11:03:45 2024]
rule rna_filtering_not_paired:
input: FASTQ/single_end/Zwt2_02160AAC_TTAGGC.fastq
output: results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
jobid: 6
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
wildcards: sample=Zwt2_02160AAC_TTAGGC
resources: tmpdir=<TBD>
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt2_02160AAC_TTAGGC.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC --other results/sortmerna_files/unpaired/rRNAf/Zwt2_02160AAC_TTAGGC --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
[Mon Aug 5 11:03:45 2024]
rule rna_filtering_not_paired:
input: FASTQ/single_end/Zwt1_02158AAC_ATCACG.fastq
output: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log
jobid: 1
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log
wildcards: sample=Zwt1_02158AAC_ATCACG
resources: tmpdir=<TBD>
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zwt1_02158AAC_ATCACG.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG --other results/sortmerna_files/unpaired/rRNAf/Zwt1_02158AAC_ATCACG --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
[Mon Aug 5 11:03:45 2024]
rule rna_filtering_not_paired:
input: FASTQ/single_end/Zcr1_02159AAC_CGATGT.fastq
output: results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
jobid: 4
reason: Missing output files: results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log
wildcards: sample=Zcr1_02159AAC_CGATGT
resources: tmpdir=<TBD>
mkdir -p results/sortmerna_files/unpaired/rRNA results/sortmerna_files/unpaired/rRNAf
sortmerna --ref /home/oscar/rnaseq/resources/rRNA_databases_v4/smr_v4.3_default_db.fasta --reads FASTQ/single_end/Zcr1_02159AAC_CGATGT.fastq --aligned results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT --other results/sortmerna_files/unpaired/rRNAf/Zcr1_02159AAC_CGATGT --workdir /home/oscar/rnaseq --fastx -threads 24 -v --idx-dir ./idx
rm -r ./kvdb
Execute 1 jobs...
[Mon Aug 5 11:03:45 2024]
rule all:
input: results/sortmerna_files/unpaired/rRNA/Zwt1_02158AAC_ATCACG.log, results/sortmerna_files/unpaired/rRNA/Zwt3_02162AAC_GATCAG.log, results/sortmerna_files/unpaired/rRNA/Zcr3_02163AAC_AGTTCC.log, results/sortmerna_files/unpaired/rRNA/Zcr1_02159AAC_CGATGT.log, results/sortmerna_files/unpaired/rRNA/Zcr2_02161AAC_CAGATC.log, results/sortmerna_files/unpaired/rRNA/Zwt2_02160AAC_TTAGGC.log
jobid: 0
reason: Rules with a run or shell declaration but no output are always executed.
resources: tmpdir=<TBD>
echo "I just run subrules!"
Job stats:
job count
------------------------ -------
all 1
rna_filtering_not_paired 6
total 7
Reasons:
(check individual jobs above for details)
input files updated by another job:
all
output files have to be generated:
rna_filtering_not_paired
run or shell but no output:
all
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Assuming the problem is what you suspect - that multiple copies of sortmerna running in the same directory at the same time interfere with each other - Snakemake has a general solution for this.
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#shadow-rules
Specifically, try using shadow: "minimal"
for your rule. The link above explains this better than I can. The use of a shadow directory does make the logs look a little more complex, but the advantages mean that I'd actually advocate making this the default for all rules unless there is a good reason not to. In NextFlow shadow directories are the default (or maybe even mandatory - I can't recall).
I'll also mention that there is a wrapper/helper available for sortmerna:
https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/sortmerna.html
If this works, you can use this and leave the fiddly details of running the application to the wrapper. But of course, if there are problems, you are now stuck debugging the wrapper code and it may just be easier to fix the rule as you have it with a shell command.