pythonsnakemake

snakemake InputFunctionException in rule KeyError


I'm running snakemake pipeline but i get KeyError: when i added longphase_phase rule. i seem not to find how to debug it. The longphase_phase rule uses as input a) each bam file from the samples in yaml file, b)modcallfile vcf file from rule longphase_modcall c)snpFile from rule call_snps_indels.

Building DAG of jobs...
InputFunctionException in rule call_snps_indels in file /data1/greenbab/users/ahunos/apps/workflows/phased_modifications/phase_5mC.smk, line 34:
Error:
  KeyError: 'D-0-2_5000/D-0-2_5000'
Wildcards:
  samples=D-0-2_5000/D-0-2_5000
Traceback:
  File "/data1/greenbab/users/ahunos/apps/workflows/phased_modifications/phase_5mC.smk", line 36, in <lambda> (rule call_snps_indels, line 45, /data1/greenbab/users/ahunos/apps/workflows/phased_modifications/phase_5mC.smk)

purpose of script: phasing of snps and 5mc

import config files

parent_dir = "/data1/greenbab/users/ahunos/apps/workflows/phased_modifications/"
set_species = "mouse"
configfile: parent_dir + "config/config.yaml"
configfile: parent_dir + "config/samples_5mC_5000sampleRate.yaml" 

pls see rules

rule all:
    input:
        expand('results/call_snps_indels/{samples}/snv.vcf.gz', samples=config["samples"]),
        expand('results/call_snps_indels/{samples}/indel.vcf.gz', samples=config["samples"]),
        expand('results/call_snps_indels/{samples}/done.{samples}.txt', samples=config["samples"]),
        expand('results/longphase_modcall/{samples}/modcall_{samples}.vcf', samples=config["samples"]),
        expand('results/longphase_modcall/{samples}/done.{samples}.txt', samples=config["samples"]),
        expand('results/longphase_phase/{samples}/{samples}.vcf', samples=config["samples"]),
        expand('results/longphase_phase/{samples}/{samples}_mod.vcf', samples=config["samples"]),
        expand('results/longphase_phase/{samples}/done.{samples}.txt', samples=config["samples"])

rule call_snps_indels:
    input:
        lambda wildcards: config["samples"][wildcards.samples]
    params:
        reference_genome=lambda wildcards: config["mm10"] if set_species == "mouse" else config["hg38"],
        software_dir=config["software_dir"],
        out_dir='results/call_snps_indels/{samples}/',
        threads=12,
        PLATFORM='ont_r10_dorado_sup_5khz'
    singularity: "/data1/greenbab/users/ahunos/apps/containers/clairs-to_latest.sif"
    output:
        done='results/call_snps_indels/{samples}/done.{samples}.txt',
        snps='results/call_snps_indels/{samples}/snv.vcf.gz',
        indels='results/call_snps_indels/{samples}/indel.vcf.gz'
    log:
      "logs/call_snps_indels/{samples}/{samples}.log"
    shell:
        """
/opt/bin/run_clairs_to \
  --tumor_bam_fn {input} \
  --ref_fn {params.reference_genome} \
  --threads {params.threads} \
  --platform {params.PLATFORM} \
  --output_dir {params.out_dir} \
  --ctg_name="chr19" \
  --conda_prefix /opt/micromamba/envs/clairs-to && touch {output.done} 2> {log}
        """

rule longphase_modcall:
    input:
        bams=lambda wildcards: config["samples"][wildcards.samples]
    params:
        reference_genome=lambda wildcards: config["mm10"] if set_species == "mouse" else config["hg38"],
        software_dir=config["software_dir"],
        lphase='/data1/greenbab/users/ahunos/apps/longphase_linux-x64',
        threads=12,
        out_modcall_prefix='results/longphase_modcall/{samples}/modcall_{samples}'
    output:
        done_modcall='results/longphase_modcall/{samples}/done.{samples}.txt',
        out_modcall='results/longphase_modcall/{samples}/modcall_{samples}.vcf'
    log:
      "logs/longphase_modcall/{samples}/{samples}.log"
    shell:
        """
/data1/greenbab/users/ahunos/apps/longphase_linux-x64 modcall -b {input.bams} -t 8 -o {output.out_modcall} -r {params.reference_genome} && touch {output.done_modcall} 2> {log}
mv results/longphase_modcall/{wildcards.samples}/modcall_{wildcards.samples}.vcf.vcf results/longphase_modcall/{wildcards.samples}/modcall_{wildcards.samples}.vcf
        """

rule longphase_phase:
    input:
        bamfile=lambda wildcards: config["samples"][wildcards.samples],
        modcallfile='results/longphase_modcall/{samples}/modcall_{samples}.vcf',
        snpFile='results/call_snps_indels/{samples}/{samples}/snv.vcf.gz'
    params:
        reference_genome=lambda wildcards: config["mm10"] if set_species == "mouse" else config["hg38"],
        lphase='/data1/greenbab/users/ahunos/apps/longphase_linux-x64',
        threads=12,
        out_lphase_prefix='results/longphase_phase/{samples}/{samples}'
    output:
        co_phased_mod_vcf='results/longphase_phase/{samples}/{samples}_mod.vcf',
        co_phased_vcf='results/longphase_phase/{samples}/{samples}.vcf',
        done_longphase_phase='results/longphase_phase/{samples}/done.{samples}.txt'
    log:
      "logs/longphase_phase/{samples}/{samples}.log"
    shell:
        """
/data1/greenbab/users/ahunos/apps/longphase_linux-x64 phase \
-s {input.snpFile} \
--mod-file {input.modcallfile} \
-b {input.bamfile} \
-r {params.reference_genome} \
-t {params.threads} \
-o {params.out_lphase_prefix} \
--ont && touch {output.done_longphase_phase}
       """

please see sample yaml

$ cat /config/samples_5mC_5000sampleRate.yaml 
samples:
    D-0-2_5000: /data1/greenbab/projects/triplicates_epigenetics_diyva/DNA/preprocessed/results/mark_duplicates/D-0-2_5000/D-0-2_5000_modBaseCalls_sorted_dup.bam

Solution

  • The error message is stemming from the line:

    lambda wildcards: config["samples"][wildcards.samples]
    

    This is an input function (hence the InputFunctionException) expressed as an anonymous lambda expression. This part itself looks fine.

    config["samples"] is a dict containing a single key: 'D-0-2_5000' but the lambda expression is trying to look up the key 'D-0-2_5000/D-0-2_5000' which is an error and raises the exception. Snakemake clarifies:

    Wildcards:
      samples=D-0-2_5000/D-0-2_5000
    

    So, why has Snakemake apparently put the sample value into the wildcard twice like this? Well, it's because the final input to your longphase_phase rule is:

    snpFile='results/call_snps_indels/{samples}/{samples}/snv.vcf.gz'
    

    And Snakemake matches this to the second output of your call_snps_indels rule:

    snps='results/call_snps_indels/{samples}/snv.vcf.gz'
    

    You can see that {samples}/{samples} accounts for the unexpected 'D-0-2_5000/D-0-2_5000' value which is being substituted into the whole {samples} wildcard.

    My suggestion is that, first off, you should add this to your Snakefile, before the rules:

    wildcard_constraints:
        samples = r"[^/]+",
    

    This will prevent the {samples} wildcard from matching across multiple path elements. It won't completely fix the workflow but it will make debugging easier as you will get a much more useful error. Setting reasonable wildcard_contraints like this is a good idea even if you don't strictly need them for the workflow to operate.