mergesnakemakevcf-vcard

Merging several vcf files using snakemake


I am trying to merge several vcf files by chromosome using snakemake. My files are like this, and as you can see has various coordinates. What is the best way to merge all chr1A and all chr1B?

chr1A:0-2096.filtered.vcf
chr1A:2096-7896.filtered.vcf
chr1B:0-3456.filtered.vcf
chr1B:3456-8796.filtered.vcf

My pseudocode:

chromosomes=["chr1A","chr1B"]

rule all:
    input: 
        expand("{sample}.vcf", sample=chromosomes)

rule merge:
    input:
        I1="path/to/file/{sample}.xxx.filtered.vcf",
        I2="path/to/file/{sample}.xxx.filtered.vcf",
    output:
        outf ="{sample}.vcf"
    shell:
        """
        java -jar picard.jar GatherVcfs I={input.I1} I={input.I2} O={output.outf}
        """

EDIT:

workdir: "/media/prova/Maxtor2/vcf2/merged/"

import subprocess

d = {"chr1A": ["chr1A:0-2096.flanking.view.filtered.vcf", "chr1A:2096-7896.flanking.view.filtered.vcf"],
     "chr1B": ["chr1B:0-3456.flanking.view.filtered.vcf", "chr1B:3456-8796.flanking.view.filtered.vcf"]}


rule all:
    input:
        expand("{sample}.vcf", sample=d)

def f(w):
    return d.get(w.chromosome, "")


rule merge:
    input:
        f
    output:
        outf ="{chromosome}.vcf"
    params:
        lambda w: "I=" + " I=".join(d[w.chromosome])
    shell:
        "java -jar /home/Documents/Tools/picard.jar GatherVcfs {params[0]} O={output.outf}"

Solution

  • I was able to reproduce your bug. When constraining the wildcards, it works:

    d = {"chr1A": ["chr1A:0-2096.flanking.view.filtered.vcf", "chr1A:2096-7896.flanking.view.filtered.vcf"],
         "chr1B": ["chr1B:0-3456.flanking.view.filtered.vcf", "chr1B:3456-8796.flanking.view.filtered.vcf"]}
    
    chromosomes = list(d)
    
    rule all:
        input:
            expand("{sample}.vcf", sample=chromosomes)
    
    # these tell Snakemake exactly what values the wildcards may take
    # we use "|" to create the regex chr1A|chr1B
    wildcard_constraints:
        chromosome = "|".join(chromosomes)
    
    rule merge:
        input:
            # a lambda is an unnamed function
            # the first argument is the wildcards
            # we merely use it to look up the appropriate files in the dict d
            lambda w: d[w.chromosome]
        output:
            outf = "{chromosome}.vcf"
        params:
            # here we create the string 
            # "I=chr1A:0-2096.flanking.view.filtered.vcf I=chr1A:2096-7896.flanking.view.filtered.vcf"
            # for use in our command
            lambda w: "I=" + " I=".join(d[w.chromosome])
        shell:
            "java -jar /home/Documents/Tools/picard.jar GatherVcfs {params[0]} O={output.outf}"
    

    It should have worked without the constraints too; this seems like a bug in Snakemake.