linuxwildcardsnakemakegatk

Snakemake: create multiple wildcards for the same argument


I am trying to run a GenotypeGVCFs on many vcf files. The command line wants every single vcf files be listed as:

java-jar GenomeAnalysisTK.jar -T GenotypeGVCFs \
-R my.fasta \
-V bob.vcf \
-V smith.vcf \
-V kelly.vcf \
-o {output.out}

How to do this in snakemake? This is my code, but I do not know how to create a wildcards for -V.

workdir: "/path/to/workdir/"

SAMPLES=["bob","smith","kelly]
print (SAMPLES)

rule all:
    input:
      "all_genotyped.vcf"

rule genotype_GVCFs:
    input:
        lambda w: "-V" + expand("{sample}.vcf", sample=SAMPLES)
    params:
        ref="my.fasta"
    output:
        out="all_genotyped.vcf"
    shell:
        """
        java-jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R {params.ref} {input} -o {output.out}
        """

Solution

  • You are putting the cart before the horse. Wildcards are needed for rule generalization: you can define a pattern for a rule where wildcards are used to define generic parts. In your example there are no patterns: everything is defined by the value of SAMPLES. This is not a recommended way to use Snakemake; the pipeline should be defined by the filesystem: which files are present on your disk.

    By the way, your code will not work, as the input shall define the list of filenames, while in your example you are (incorrectly) trying to define the strings like "-V filename".

    So, you have the output: "all_genotyped.vcf". You have the input: ["bob.vcf", "smith.vcf", "kelly.vcf"]. You don't even need to use a lambda here, as the input doesn't depend on any wildcard. So, you have:

    rule genotype_GVCFs:
        input:
            expand("{sample}.vcf", sample=SAMPLES)
        output:
            "all_genotyped.vcf"
        ...
    

    Actually you don't even need input section. If you know for sure that the files from SAMPLES list exist, you may skip it.

    The values for -V can be defined in params:

    rule genotype_GVCFs:
        #input:
        #    expand("{sample}.vcf", sample=SAMPLES)
        output:
            "all_genotyped.vcf"
        params:
            ref = "my.fasta",
            vcf = expand("-V {sample}", sample=SAMPLES)
        shell:
            """
            java-jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R {params.ref} {params.vcf} -o {output}
            """
    

    This should solve your issue, but I would advise you to rethink your solution. The use of SAMPLE list smells. Alternatively: do you really need Snakemake if you have all dependencies defined already?