I'm trying to create this snakemake workflow which would evaluate raw reads quality using FastQc and create a raport using MultiQC. I use 4 input files and get expected results, however I just noticed that each rule gets run 4 times and takes all 4 inputs each time and I'm not sure how to fix that. Could anyone help me figure out how to:
I'm trying to follow the snakemake tutorial but no luck so far.
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
rule raw_fastqc:
input:
expand("data/samples/{sample}.fastq", sample=config["samples"])
output:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip"
shell:
"fastqc {input} -o outputs/fastqc_1/"
rule raw_multiqc:
input:
expand("outputs/fastqc_1/{sample}_fastqc.html", sample=config["samples"]),
expand("outputs/fastqc_1/{sample}_fastqc.zip", sample=config["samples"])
output:
"outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
shell:
"multiqc ./outputs/fastqc_1/ -n {output}"
config.yaml file:
samples:
Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq
Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq
KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq
KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq
I run the snakemake using command:
snakemake -s Snakefile --core 1
Each rule is run 4 times:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
----------- ------- ------------- -------------
all 1 1 1
raw_fastqc 4 1 1
raw_multiqc 4 1 1
total 9 1 1
But each time all 4 inputs are used:
[Sun May 15 23:06:22 2022]
rule raw_fastqc:
input: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq, data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq
output: outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.html, outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.zip
jobid: 3
wildcards: sample=Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001
resources: tmpdir=/tmp
Your problem is using expand()
in the input of each rule. Because expand
fills in wildcard values, you only need to do that in the all
rule since wildcard values are passed on to upstream rules.
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
rule raw_fastqc:
input:
"data/samples/{sample}.fastq"
output:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip"
shell:
"fastqc {input} -o outputs/fastqc_1/"
rule raw_multiqc:
input:
"outputs/fastqc_1/{sample}_fastqc.html",
"outputs/fastqc_1/{sample}_fastqc.zip",
output:
"outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
shell:
"multiqc ./outputs/fastqc_1/ -n {output}"