I'm using bcftools consensus
to extract haplotypes from a vcf file.
Given the input files:
A.sorted.bam
B.sorted.bam
The following output files are created:
A.hap1.fna
A.hap2.fna
B.hap1.fna
B.hap2.fna
I currently have two rules to do this. They differ only by the numbers 1 and 2 in the output files and shell command. Code:
rule consensus1:
input:
vcf="variants/phased.vcf.gz",
tbi="variants/phased.vcf.gz.tbi",
bam="alignments/{sample}.sorted.bam"
output:
"haplotypes/{sample}.hap1.fna"
params:
sample="{sample}"
shell:
"bcftools consensus -i -s {params.sample} -H 1 -f {reference_file} {input.vcf} > {output}"
rule consensus2:
input:
vcf="variants/phased.vcf.gz",
tbi="variants/phased.vcf.gz.tbi",
bam="alignments/{sample}.sorted.bam"
output:
"haplotypes/{sample}.hap2.fna"
params:
sample="{sample}"
shell:
"bcftools consensus -i -s {params.sample} -H 2 -f {reference_file} {input.vcf} > {output}"
While this code works, it seems that there should be a better, more pythonic way to do this using only one rule. Is it possible to collapse this into one rule, or is my current method the best way?
Use wildcards for haplotypes 1 and 2 in rule all
. See here to learn more about adding targets via rule all
reference_file = "ref.txt"
rule all:
input:
expand("haplotypes/{sample}.hap{hap_no}.fna",
sample=["A", "B"], hap_no=["1", "2"])
rule consensus1:
input:
vcf="variants/phased.vcf.gz",
tbi="variants/phased.vcf.gz.tbi",
bam="alignments/{sample}.sorted.bam"
output:
"haplotypes/{sample}.hap{hap_no}.fna"
params:
sample="{sample}",
hap_no="{hap_no}"
shell:
"bcftools consensus -i -s {params.sample} -H {params.hap_no} \
-f {reference_file} {input.vcf} > {output}"