pythonwildcardsnakemakedirected-acyclic-graphswildcard-expansion

how to quickly identify if a rule in Snakemake needs an input function


I'm following the snakemake tutorial on their documentation page and really got stuck on the concept of input functions https://snakemake.readthedocs.io/en/stable/tutorial/advanced.html#step-3-input-functions

Basically they define a config.yaml as follows:

samples:
  A: data/samples/A.fastq
  B: data/samples/B.fastq

and the Snakefile as follows without any input function:

configfile: "config.yaml"

rule all:
    input:
        "plots/quals.svg"

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    threads: 12
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} -O bam {input} > {output}"

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

rule bcftools_call:
    input:
        fa = "data/genome.fa",
        bam = expand("sorted_reads/{sample}.bam",sample=config['samples']),
        bai = expand("sorted_reads/{sample}.bam.bai",sample=config['samples'])
    output:
        "calls/all.vcf"
    shell:
        "bcftools mpileup -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

rule plot_quals:
    input:
        "calls/all.vcf"
    output:
        "plots/quals.svg"
    script:
        "scripts/plot-quals.py"

In the tutorial they mention that this expansion happens in the initialization step:

bam = expand("sorted_reads/{sample}.bam",sample=config['samples']),
bai = expand("sorted_reads/{sample}.bam.bai",sample=config['samples'])

and that the FASTQ paths cannot be determined for rule bwa_map in this phase. However the code works if we run as is, why is that ?

Then they recommend using an input function to defer bwa_map to the next phase (DAG phase) as follows:

def get_bwa_map_input_fastqs(wildcards):
    return config["samples"][wildcards.sample]

rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
    output:
        "mapped_reads/{sample}.bam"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

I'm really confused when an input function makes sense and when it does not ?


Solution

  • SultanOrazbayev has a good answer already. Here's another typical example.

    Often, the input and output files share the same pattern (wildcards). For example, if you want to sort a file you may do: input: {name}.txt -> output: {name}.sorted.txt.

    Sometimes however the input files are not linked to the output by a simple pattern. An example from bioinformatics is a rule that align reads to a genome:

    rule align:
        input:
            reads= '{name}.fastq',
            genome= 'human_genome.fa',
        output:
            bam= '{name}.bam',
        shell: ...
    

    here the name of the genome file is unrelated to the name of the input reads file and the name of the output bam file. The rule above works because the reference genome is a concrete filename without wildcards.

    But: What if the choice of reference genome depends on the input fastq file? For same input reads you may need the mouse genome and for others the human genome. An input function comes handy:

    def get_genome(wildcards):
        if wildcards.name in ['bob', 'alice']:
            return 'human_genome.fa',
        if wildcards.name in ['mickey', 'jerry'],
            return 'mouse_genome.fa',
    
    rule align:
        input:
            reads= '{name}.fastq',
            genome= get_genome,
        output:
            bam= '{name}.bam',
        shell: ...
    

    now the reference genome is mouse or human depending on the input reads.