I am a new Snakemake user and trying to develop a pipeline using some data to be able to implement to to our real data. I have multiple folders (one folder for each patient) and in each folder there are multiple files for each tumour and normal samples: Here is the structure of my directories
`A/
A-T1.fastq
A-N1.fastq
B/
B-T1.fastq
B-N1.fastq
C/
C-T1.fastq
C-N1.fastq`
and so on .... (in total more than 100 directories).
This is also my snake file:
`#!/usr/bin/env snakemake
configfile:
"config.json"
(DIRS,SAMPLES) = glob_wildcards(config['data']+"{dir}/{sample}.fastq")
rule all:
input:
expand("results/mapped/{dir}/{sample}.sorted.bam", dir=DIRS, sample=SAMPLES)
rule symlink:
input:
expand(config['data']+"{dir}/{{sample}}.fastq")
output:
"00-input/{dir}/{sample}.fastq"
shell:
"ln -s {input} {output}"
rule map_reads:
input:
"data/genome.fa",
"00-input/{dir}/{sample}.fastq"
output:
"results/mapped/{dir}/{sample}.bam"
conda:
"envs/samtools.yaml"
shell:
"bwa mem {input} | samtools view -b - > {output}"
rule sort_alignments:
input:
"results/mapped/{dir}/{sample}.bam"
output:
"results/mapped/{dir}/{sample}.sorted.bam"
conda:
"envs/samtools.yaml"
shell:
"samtools sort -o {output} {input}"`
this is also my config file:
`{
"data": "/analysis/Anna/snakemake-demo/data/samples_fastq/"
}`
By running this script I get the following error message:
`WildcardError in line 13:
No values given for wildcard 'dir'.
`
I tried a different way by adding modifying my rule symlink:
` input:
expand(config['data']+"{{dir}}/{{sample}}.fastq")
`
And this time I get a different error message:
`Missing input files for rule symlink:`
I have looked through several similar questions on Stack but have not been able to fix my error so far. I appreciate if someone could help me to learn where is my mistake and any clues how I can fix that. Thank you
I tried similar issues on stack to fix the error but still struggling.
So after trying different ways and going over several stack posts finally I got the solution to my question using the super useful answer from this question Process multiple directories and all files within using snakemake and https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-do-i-run-my-rule-on-all-files-of-a-certain-directory. By default the expand function uses itertools.product to create every combination of the supplied wildcards. Expand takes an optional, second positional argument which can customize how wildcards are combined. I needed to add "zip
" and here is my worked example code: I slightly simplified it compared to my original question
#!/usr/bin/env snakemake
configfile:
"config.json"
DIRS,SAMPLES = glob_wildcards(config['data']+"{dir}/{sample}.fastq")
rule all:
input:
expand("results/mapped/{dir}/{sample}.sorted.bam", zip, dir=DIRS, sample=SAMPLES)
rule symlink:
input:
config['data']+"{dir}/{sample}.fastq"
output:
"00-input/{dir}/{sample}.fastq"
shell:
"ln -s {input} {output}"
rule map_reads:
input:
fasta="data/genome.fa",
fastq=rules.symlink.output
output:
"results/mapped/{dir}/{sample}.bam"
conda:
"envs/samtools.yaml"
shell:
"bwa mem {input} | samtools view -b - > {output}"
rule sort_alignments:
input:
rules.map_reads.output
output:
"results/mapped/{dir}/{sample}.sorted.bam"
conda:
"envs/samtools.yaml"
shell:
"samtools sort -o {output} {input}"