snakemakeexpand

How to solve wildcards in input files cannot be determined from output files in snakelike


I am a new Snakemake user and trying to develop a pipeline using some data to be able to implement to to our real data. I have multiple folders (one folder for each patient) and in each folder there are multiple files for each tumour and normal samples: Here is the structure of my directories

`A/
 A-T1.fastq
 A-N1.fastq

B/
 B-T1.fastq
 B-N1.fastq

C/
 C-T1.fastq
 C-N1.fastq`

and so on .... (in total more than 100 directories).

This is also my snake file:

`#!/usr/bin/env snakemake

configfile:
    "config.json"

(DIRS,SAMPLES) = glob_wildcards(config['data']+"{dir}/{sample}.fastq")

rule all:
    input:
        expand("results/mapped/{dir}/{sample}.sorted.bam", dir=DIRS, sample=SAMPLES)

rule symlink:
    input:
         expand(config['data']+"{dir}/{{sample}}.fastq")
    output:
         "00-input/{dir}/{sample}.fastq"
    shell: 
        "ln -s {input} {output}"     
               

rule map_reads:
    input:
        "data/genome.fa",
        "00-input/{dir}/{sample}.fastq"
    output:
        "results/mapped/{dir}/{sample}.bam"
    conda:
        "envs/samtools.yaml"
    shell:
        "bwa mem {input} | samtools view -b - > {output}"


rule sort_alignments:
    input:
        "results/mapped/{dir}/{sample}.bam"
    output:
        "results/mapped/{dir}/{sample}.sorted.bam"
    conda:
        "envs/samtools.yaml"
    shell:
        "samtools sort -o {output} {input}"`

this is also my config file:

`{
    "data": "/analysis/Anna/snakemake-demo/data/samples_fastq/"
}`

By running this script I get the following error message:

`WildcardError in line 13:
No values given for wildcard 'dir'.
`

I tried a different way by adding modifying my rule symlink:

 ` input:
         expand(config['data']+"{{dir}}/{{sample}}.fastq")
`

And this time I get a different error message:

`Missing input files for rule symlink:`

I have looked through several similar questions on Stack but have not been able to fix my error so far. I appreciate if someone could help me to learn where is my mistake and any clues how I can fix that. Thank you

I tried similar issues on stack to fix the error but still struggling.


Solution

  • So after trying different ways and going over several stack posts finally I got the solution to my question using the super useful answer from this question Process multiple directories and all files within using snakemake and https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-do-i-run-my-rule-on-all-files-of-a-certain-directory. By default the expand function uses itertools.product to create every combination of the supplied wildcards. Expand takes an optional, second positional argument which can customize how wildcards are combined. I needed to add "zip" and here is my worked example code: I slightly simplified it compared to my original question

    #!/usr/bin/env snakemake
    
    configfile:
        "config.json"
    
    DIRS,SAMPLES = glob_wildcards(config['data']+"{dir}/{sample}.fastq")
    
    rule all:
        input:
            expand("results/mapped/{dir}/{sample}.sorted.bam", zip, dir=DIRS, sample=SAMPLES)
    
    rule symlink:
        input:
             config['data']+"{dir}/{sample}.fastq"
        output:
             "00-input/{dir}/{sample}.fastq"
        shell: 
            "ln -s {input} {output}"     
                   
    
    rule map_reads:
        input:
            fasta="data/genome.fa",
            fastq=rules.symlink.output
        output:
            "results/mapped/{dir}/{sample}.bam"
        conda:
            "envs/samtools.yaml"
        shell:
            "bwa mem {input} | samtools view -b - > {output}"
    
    
    rule sort_alignments:
        input:
            rules.map_reads.output
        output:
            "results/mapped/{dir}/{sample}.sorted.bam"
        conda:
            "envs/samtools.yaml"
        shell:
            "samtools sort -o {output} {input}"