python bioinformatics snakemake transcription

Snakemake runs rule too many times using config.yaml

I'm trying to create this snakemake workflow which would evaluate raw reads quality using FastQc and create a raport using MultiQC. I use 4 input files and get expected results, however I just noticed that each rule gets run 4 times and takes all 4 inputs each time and I'm not sure how to fix that. Could anyone help me figure out how to:

Run the rule 4 times but use only one input from config.yaml at a time?
Run the rule 1 time but use all 4 inputs?

I'm trying to follow the snakemake tutorial but no luck so far.

Snakefile:

configfile: "config.yaml"

rule all:
    input:
       expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
        
rule raw_fastqc:
    input:
        expand("data/samples/{sample}.fastq", sample=config["samples"])
    output:
        "outputs/fastqc_1/{sample}_fastqc.html",
        "outputs/fastqc_1/{sample}_fastqc.zip"
    shell:
        "fastqc {input} -o outputs/fastqc_1/"

rule raw_multiqc:
    input:
        expand("outputs/fastqc_1/{sample}_fastqc.html", sample=config["samples"]),
        expand("outputs/fastqc_1/{sample}_fastqc.zip", sample=config["samples"])
    output:
        "outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
    shell:
        "multiqc ./outputs/fastqc_1/ -n {output}"

config.yaml file:

samples:
    Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq
    Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq
    KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq
    KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001: data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq

I run the snakemake using command:

snakemake -s Snakefile --core 1

Each rule is run 4 times:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job            count    min threads    max threads
-----------  -------  -------------  -------------
all                1              1              1
raw_fastqc         4              1              1
raw_multiqc        4              1              1
total              9              1              1

But each time all 4 inputs are used:

[Sun May 15 23:06:22 2022]
rule raw_fastqc:
    input: data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R1_001.fastq, data/samples/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R1_001.fastq, data/samples/KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R2_001.fastq
    output: outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.html, outputs/fastqc_1/Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001_fastqc.zip
    jobid: 3
    wildcards: sample=Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R2_001
    resources: tmpdir=/tmp

Solution

Your problem is using expand() in the input of each rule. Because expand fills in wildcard values, you only need to do that in the all rule since wildcard values are passed on to upstream rules.

Snakefile:

configfile: "config.yaml"

rule all:
    input:
       expand("outputs/multiqc_report_1/{sample}_multiqc_report_1.html", sample=config["samples"])
        
rule raw_fastqc:
    input:
       "data/samples/{sample}.fastq"
    output:
        "outputs/fastqc_1/{sample}_fastqc.html",
        "outputs/fastqc_1/{sample}_fastqc.zip"
    shell:
        "fastqc {input} -o outputs/fastqc_1/"

rule raw_multiqc:
    input:
       "outputs/fastqc_1/{sample}_fastqc.html", 
       "outputs/fastqc_1/{sample}_fastqc.zip", 
    output:
        "outputs/multiqc_report_1/{sample}_multiqc_report_1.html"
    shell:
        "multiqc ./outputs/fastqc_1/ -n {output}"