pythonsnakemakegatk

Snakemake first genotype of a vcf file as wildcard in output


In the second rule I would like to select from the vcf file containing bob, clara and tim, only the first genotype of dictionary (i.e. bob) in roder to get as output in the second rule bob.dn.vcf. Is this possible in snakemake?

    d = {"FAM1": ["bob.bam", "clara.bam", "tim.bam"]}
    FAMILIES = list(d)
    
    rule all:
        input:
            expand some outputs
            
    wildcard_constraints:
        family = "|".join(FAMILIES)
    
    rule somerulename:
        input:
            lambda w: d[w.family]
        output:
            vcf="{family}/{family}.vcf"
        shell:
            """
            some python command line which produces a single vcf file with bob, clara and tim
            """
    
    rule somerulename:
        input:
            invcf="{family}/{family}.vcf"
        params:
            ref="someref.fasta"
        output:
            out="{family}/{bob}.dn.vcf"
        shell:
            """
            gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -O {output.out}
            """

Solution

  • There are at least two options:

    1. explicitly specify output:
    rule somerulename:
        output:
            out="FAM1/bob.dn.vcf"
    
    1. impose constraints on wildcard values:
    rule somerulename:
        output:
            out="{family}/{bob}.dn.vcf"
        wildcard_constraints:
            family="FAM1",
            bob="bob",
    
    1. control what is produced by specifying appropriate inputs to rule all:
    rule all:
        input: "FAM1/bob.dn.vcf", "FAM2/alice.dn.vcf"