bashawkbioinformaticsvcf-variant-call-format

Extracting vcf columns substring with awk


I have vcf file like this:

##bcftools_annotateVersion=1.3.1+htslib-1.3.1
##bcftools_annotateCommand=annotate 
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG005
chr1    817186  rs3094315   G   A   50  PASS    platforms=2;platformnames=Illumina,CG;datasets=3;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair;callsets=5;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=IonExome,SolidSE75bp;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2 GT:PS:DP:ADALL:AD:GQ    1/1:.:809:0,363:78,428:237
chr1    817341  rs3131972   A   G   50  PASS    platforms=3;platformnames=Illumina,CG,Solid;datasets=4;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair,SolidSE75bp;callsets=6;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes,SolidSE75GATKHC;datasetsmissingcall=IonExome;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2   GT:PS:DP:ADALL:AD:GQ    1/1:.:732:1,330:99,391:302

I need to extract ID column and AN from INFO column to get:

ID         INFO
rs3094315   2
rs3131972   2

I'm trying something like this awk '/^[^#]/ { print $3, gsub(/^[^AN=])/,"",$8)}' file.vcf, but still not getting the desired result.


Solution

  • Usual reminder, there are dedicated tools for this kind of thing and there's no reason to use something like awk or sed, especially if you are a beginner and you don't really understand what the command is doing, as there are many pitfalls of parsing vcf files. There's every reason to think an awk script will silenty fail on a new/previous version of the vcf format in some weird and wonderful.

    Simple bcftools solution:

    bcftools query -f'%ID [%AN]\n' in.vcf > out.txt
    

    bcftools format extracts fields from the vcf. Within the -f formqt flag, for INFO fields, use the format %FIELD and for FORMAT fields, use the format [%FIELD].