I have vcf file like this:
##bcftools_annotateVersion=1.3.1+htslib-1.3.1
##bcftools_annotateCommand=annotate
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG005
chr1 817186 rs3094315 G A 50 PASS platforms=2;platformnames=Illumina,CG;datasets=3;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair;callsets=5;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=IonExome,SolidSE75bp;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2 GT:PS:DP:ADALL:AD:GQ 1/1:.:809:0,363:78,428:237
chr1 817341 rs3131972 A G 50 PASS platforms=3;platformnames=Illumina,CG,Solid;datasets=4;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair,SolidSE75bp;callsets=6;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes,SolidSE75GATKHC;datasetsmissingcall=IonExome;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2 GT:PS:DP:ADALL:AD:GQ 1/1:.:732:1,330:99,391:302
I need to extract ID column and AN from INFO column to get:
ID INFO
rs3094315 2
rs3131972 2
I'm trying something like this awk '/^[^#]/ { print $3, gsub(/^[^AN=])/,"",$8)}' file.vcf
, but still not getting the desired result.
Usual reminder, there are dedicated tools for this kind of thing and there's no reason to use something like awk or sed, especially if you are a beginner and you don't really understand what the command is doing, as there are many pitfalls of parsing vcf files. There's every reason to think an awk script will silenty fail on a new/previous version of the vcf format in some weird and wonderful.
Simple bcftools solution:
bcftools query -f'%ID [%AN]\n' in.vcf > out.txt
bcftools format
extracts fields from the vcf. Within the -f
formqt flag, for INFO
fields, use the format %FIELD
and for FORMAT
fields, use the format [%FIELD]
.