I'm trying to optimize an awk code to generate a new columns into my VCF file. This is the input (minimal working example):
#CHROM POS GENE REF ALT QUAL FILTER INFO FORMAT 1 2 3 4 5 6
chr14 33925904 EGLN3 A T . . CSQ=T| GT:AQ:DP:AD:AB:GQ:PL 0/1:.:0:.:0,0:.:. 1/1:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:.
chr14 33925905 EGLN3 G C . . CSQ=C| GT:AQ:DP:AD:AB:GQ:PL ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. 0/0:.:0:.:0,0:.:. 1/0:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:.
chr14 33925907 EGLN3 G A . . CSQ=A| GT:AQ:DP:AD:AB:GQ:PL ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. 0/0:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:.
chr14 33925918 EGLN3 T G . . CSQ=G| GT:AQ:DP:AD:AB:GQ:PL ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. 1/0:.:0:.:0,0:.:.
chr14 33927014 EGLN3 A G . . CSQ=G| GT:AQ:DP:AD:AB:GQ:PL ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. 1/0:.:0:.:0,0:.:.
chr14 33927025 EGLN3 AT A . . CSQ=-| GT:AQ:DP:AD:AB:GQ:PL ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:.
chr14 33929079 EGLN3 G A . . CSQ=A| GT:AQ:DP:AD:AB:GQ:PL ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. ./.:.:0:.:0,0:.:. 0/0:.:0:.:0,0:.:.
I use this code:
paste <(bcftools view $MYVCF | awk -F"\t" 'BEGIN {print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}') \ #prints first columns from vcf - Important if file has a vcf header
<(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" |\
awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}') \ #counts samples (1-6) that have 0/1 or 1/0
\
<(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" |\
awk 'BEGIN {print "nHomAlt"} {print gsub(/1\|1|1\/1/, "")}') \
\ #counts samples with 1/1
<(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" |\
awk 'BEGIN {print "nHomRef"} {print gsub(/0\|0|0\/0/, "")}') >> example3.txt
Which does the following:
First, it prints the columns CHR POS ID REF ALT and INFO, then searches for GT values in samples (1-6) with specific 0/0 and prints the counts of samples which have this specific GT in a new column named nHomRef. Does the same to 1/0 (and combinations) and output to column named nHet and then finally does the same to 1/1 combinations and outputs to a new column named hHomRef
This is outputting:
#CHROM POS ID REF ALT QUAL FILTER INFO nHet nHomAlt nHomRef
chr14 33925904 EGLN3 A T . . CSQ=T| 1 1 0
chr14 33925905 EGLN3 G C . . CSQ=C| 1 0 1
chr14 33925907 EGLN3 G A . . CSQ=A| 0 0 1
chr14 33925918 EGLN3 T G . . CSQ=G| 1 0 0
chr14 33927014 EGLN3 A G . . CSQ=G| 1 0 0
chr14 33927025 EGLN3 AT A . . CSQ=-| 0 0 0
chr14 33929079 EGLN3 G A . . CSQ=A| 0 0 1
I'm trying to optimize the code so not many subprocesses are created. To simplify I used only two columns
paste <(bcftools view "$MYVCF" | awk -F"\t" 'BEGIN {print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" | awk 'BEGIN {print "nHomAlt\tnHet"} {print gsub(/1\|1|1\/1/, "")} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}') >> example3_test.txt
How can I output the gsub, inside awk, to the correct column? So, this snippet
awk 'BEGIN {print "nHomAlt\tnHet"} {print gsub(/1\|1|1\/1/, "")} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}'
To output the correct columns? I`m getting:
#CHROM POS ID REF ALT QUAL FILTER INFO nHomAlt nHet
chr14 33925904 EGLN3 A T . . CSQ=T| 1
chr14 33925905 EGLN3 G C . . CSQ=C| 1
chr14 33925907 EGLN3 G A . . CSQ=A| 0
chr14 33925918 EGLN3 T G . . CSQ=G| 1
chr14 33927014 EGLN3 A G . . CSQ=G| 0
chr14 33927025 EGLN3 AT A . . CSQ=-| 0
chr14 33929079 EGLN3 G A . . CSQ=A| 0
1
0
1
0
0
0
0
PS: Using
paste <(bcftools view $MYVCF | awk -F"\t" 'BEGIN {print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}') <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" | awk 'BEGIN {OFS="\t"; print "nHomAlt\tnHet"} {
nHet=gsub(/0\|1|1\|0|0\/1|1\/0/, "");
nHomAlt=gsub(/1\|1|1\/1/, "");
print nHomAlt,nHet;
}')
Seems to give the correct output!
First, any time your goal includes "to optimize", your first effort should be to eliminate repeating any work... such as the repeats of
bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" | awk
Part of your problem is trying to push your data through paste
when you already have it in awk
. I think what you really need to do is remember that print
is going to append an output record separator, or ORS, which by default is a newline character. You could use printf
to output each field, but I think it's easier to just pass your fields as arguments. You can also set the OFS (output field separator) to delimit them with a tab by default.
Working from your input example, which I stored in a file, you probably just need to manage your data collection and formatting. Here's one possible example:
$: awk 'BEGIN {OFS="\t"; print "nHomAlt\tnHet"} {
nHet=gsub(/0\|1|1\|0|0\/1|1\/0/, "");
nHomAlt=gsub(/1\|1|1\/1/, "");
print nHomAlt,nHet;
}' input
nHomAlt nHet
1 3
0 3
3 2
0 1
0 3
0 0
0 7
0 0
You said you wanted counts, and that's what gsub
returns, so I assume that's why you used it.
What you want probably looks something like this:
awk 'BEGIN { OFS="\t"; print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tnHet\tnHomAlt\tnHomRef" } {
nHet=gsub(/0[|/]1|1[|/]0/, "");
nHomAlt=gsub(/1[|/]1/, "");
nHomRef=gsub(/0[|/]0/, "");
print $1,$2,$3,$4,$5,$6,$7,$8,nHet,nHomAlt,nHomRef;
}' input # maybe input should be < <( <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" )
...which shouldn't need the paste
or the repeated calls to awk
.
You obviously already know how to substitute your command in for the input. Hopefully that will get you close enough.