I'm running snakemake pipeline but i get KeyError: when i added longphase_phase rule. i seem not to find how to debug it. The longphase_phase rule uses as input a) each bam file from the samples in yaml file, b)modcallfile vcf file from rule longphase_modcall
c)snpFile from rule call_snps_indels
.
Building DAG of jobs...
InputFunctionException in rule call_snps_indels in file /data1/greenbab/users/ahunos/apps/workflows/phased_modifications/phase_5mC.smk, line 34:
Error:
KeyError: 'D-0-2_5000/D-0-2_5000'
Wildcards:
samples=D-0-2_5000/D-0-2_5000
Traceback:
File "/data1/greenbab/users/ahunos/apps/workflows/phased_modifications/phase_5mC.smk", line 36, in <lambda> (rule call_snps_indels, line 45, /data1/greenbab/users/ahunos/apps/workflows/phased_modifications/phase_5mC.smk)
purpose of script: phasing of snps and 5mc
import config files
parent_dir = "/data1/greenbab/users/ahunos/apps/workflows/phased_modifications/"
set_species = "mouse"
configfile: parent_dir + "config/config.yaml"
configfile: parent_dir + "config/samples_5mC_5000sampleRate.yaml"
pls see rules
rule all:
input:
expand('results/call_snps_indels/{samples}/snv.vcf.gz', samples=config["samples"]),
expand('results/call_snps_indels/{samples}/indel.vcf.gz', samples=config["samples"]),
expand('results/call_snps_indels/{samples}/done.{samples}.txt', samples=config["samples"]),
expand('results/longphase_modcall/{samples}/modcall_{samples}.vcf', samples=config["samples"]),
expand('results/longphase_modcall/{samples}/done.{samples}.txt', samples=config["samples"]),
expand('results/longphase_phase/{samples}/{samples}.vcf', samples=config["samples"]),
expand('results/longphase_phase/{samples}/{samples}_mod.vcf', samples=config["samples"]),
expand('results/longphase_phase/{samples}/done.{samples}.txt', samples=config["samples"])
rule call_snps_indels:
input:
lambda wildcards: config["samples"][wildcards.samples]
params:
reference_genome=lambda wildcards: config["mm10"] if set_species == "mouse" else config["hg38"],
software_dir=config["software_dir"],
out_dir='results/call_snps_indels/{samples}/',
threads=12,
PLATFORM='ont_r10_dorado_sup_5khz'
singularity: "/data1/greenbab/users/ahunos/apps/containers/clairs-to_latest.sif"
output:
done='results/call_snps_indels/{samples}/done.{samples}.txt',
snps='results/call_snps_indels/{samples}/snv.vcf.gz',
indels='results/call_snps_indels/{samples}/indel.vcf.gz'
log:
"logs/call_snps_indels/{samples}/{samples}.log"
shell:
"""
/opt/bin/run_clairs_to \
--tumor_bam_fn {input} \
--ref_fn {params.reference_genome} \
--threads {params.threads} \
--platform {params.PLATFORM} \
--output_dir {params.out_dir} \
--ctg_name="chr19" \
--conda_prefix /opt/micromamba/envs/clairs-to && touch {output.done} 2> {log}
"""
rule longphase_modcall:
input:
bams=lambda wildcards: config["samples"][wildcards.samples]
params:
reference_genome=lambda wildcards: config["mm10"] if set_species == "mouse" else config["hg38"],
software_dir=config["software_dir"],
lphase='/data1/greenbab/users/ahunos/apps/longphase_linux-x64',
threads=12,
out_modcall_prefix='results/longphase_modcall/{samples}/modcall_{samples}'
output:
done_modcall='results/longphase_modcall/{samples}/done.{samples}.txt',
out_modcall='results/longphase_modcall/{samples}/modcall_{samples}.vcf'
log:
"logs/longphase_modcall/{samples}/{samples}.log"
shell:
"""
/data1/greenbab/users/ahunos/apps/longphase_linux-x64 modcall -b {input.bams} -t 8 -o {output.out_modcall} -r {params.reference_genome} && touch {output.done_modcall} 2> {log}
mv results/longphase_modcall/{wildcards.samples}/modcall_{wildcards.samples}.vcf.vcf results/longphase_modcall/{wildcards.samples}/modcall_{wildcards.samples}.vcf
"""
rule longphase_phase:
input:
bamfile=lambda wildcards: config["samples"][wildcards.samples],
modcallfile='results/longphase_modcall/{samples}/modcall_{samples}.vcf',
snpFile='results/call_snps_indels/{samples}/{samples}/snv.vcf.gz'
params:
reference_genome=lambda wildcards: config["mm10"] if set_species == "mouse" else config["hg38"],
lphase='/data1/greenbab/users/ahunos/apps/longphase_linux-x64',
threads=12,
out_lphase_prefix='results/longphase_phase/{samples}/{samples}'
output:
co_phased_mod_vcf='results/longphase_phase/{samples}/{samples}_mod.vcf',
co_phased_vcf='results/longphase_phase/{samples}/{samples}.vcf',
done_longphase_phase='results/longphase_phase/{samples}/done.{samples}.txt'
log:
"logs/longphase_phase/{samples}/{samples}.log"
shell:
"""
/data1/greenbab/users/ahunos/apps/longphase_linux-x64 phase \
-s {input.snpFile} \
--mod-file {input.modcallfile} \
-b {input.bamfile} \
-r {params.reference_genome} \
-t {params.threads} \
-o {params.out_lphase_prefix} \
--ont && touch {output.done_longphase_phase}
"""
please see sample yaml
$ cat /config/samples_5mC_5000sampleRate.yaml
samples:
D-0-2_5000: /data1/greenbab/projects/triplicates_epigenetics_diyva/DNA/preprocessed/results/mark_duplicates/D-0-2_5000/D-0-2_5000_modBaseCalls_sorted_dup.bam
The error message is stemming from the line:
lambda wildcards: config["samples"][wildcards.samples]
This is an input function (hence the InputFunctionException
) expressed as an anonymous lambda expression. This part itself looks fine.
config["samples"]
is a dict containing a single key: 'D-0-2_5000'
but the lambda expression is trying to look up the key 'D-0-2_5000/D-0-2_5000'
which is an error and raises the exception. Snakemake clarifies:
Wildcards:
samples=D-0-2_5000/D-0-2_5000
So, why has Snakemake apparently put the sample value into the wildcard twice like this? Well, it's because the final input to your longphase_phase
rule is:
snpFile='results/call_snps_indels/{samples}/{samples}/snv.vcf.gz'
And Snakemake matches this to the second output of your call_snps_indels
rule:
snps='results/call_snps_indels/{samples}/snv.vcf.gz'
You can see that {samples}/{samples}
accounts for the unexpected 'D-0-2_5000/D-0-2_5000'
value which is being substituted into the whole {samples}
wildcard.
My suggestion is that, first off, you should add this to your Snakefile, before the rules:
wildcard_constraints:
samples = r"[^/]+",
This will prevent the {samples}
wildcard from matching across multiple path elements. It won't completely fix the workflow but it will make debugging easier as you will get a much more useful error. Setting reasonable wildcard_contraints like this is a good idea even if you don't strictly need them for the workflow to operate.