I'm looking for to add a new column of 'GENE_ID' to the vcf file based on an intersect between bed coordinate and vcf positions. The 'GENE_ID' is already present as the 4th column in the bed file. Basically I need to know how to keep that gene name column (GENE_ID) in the bed file into the new vcf file!
Moreover, I was wondering how to get the result from multiple vcf and bed file in one run.
Let's suppose bed file as follow:
> coord.bed
chr7 71335 73335 ENSG00000232325
chr7 75538 77538 ENSG00000242611
chr7 144930 146930 ENSG00000242474
chr7 148097 150278 ENSG00000240859
chr7 148862 150966 ENSG00000242474
chr7 151179 153179 ENSG00000240859
chr7 164472 166472 ENSG00000261795
chr7 173420 175420 ENSG00000239715
and the first three columns of vcf file:
#CHROM POS ID
7 72339 7_31439_T_A_b37
7 75999 7_31504_G_A_b37
7 146125 7_34713_A_C_b37
7 149978 7_34918_C_T_b37
7 174401 7_35119_G_A_b37
the desired output:
#CHROM POS ID GENE_ID
7 72339 7_31439_T_A_b37 ENSG00000232325
7 75999 7_31504_G_A_b37 ENSG00000242611
7 146125 7_34713_A_C_b37 ENSG00000242474
7 150478 7_34918_C_T_b37 ENSG00000242474
7 174401 7_35119_G_A_b37 ENSG00000239715
I'd appreciate any suggestion!
Here's one way with bedtools
. The awk
s temporarily present a correct format to bedtools
and print the desired end result.
% bedtools intersect\
-a <(awk -v OFS="\t" '{print "chr"$1,$2,$2,$3}' vcf)\
-b <(awk -v OFS="\t" '{print $1,$2,$3,$4}' bed)\
-wb | awk -v OFS="\t" '{print substr($1,4,length($1)-3),$2,$4,$8}'
7 72339 7_31439_T_A_b37 ENSG00000232325
7 75999 7_31504_G_A_b37 ENSG00000242611
7 146125 7_34713_A_C_b37 ENSG00000242474
7 149978 7_34918_C_T_b37 ENSG00000240859
7 149978 7_34918_C_T_b37 ENSG00000242474
7 174401 7_35119_G_A_b37 ENSG00000239715
See bedtools intersect --help
for multiple files:
Note: -b may be followed with multiple databases and/or wildcard (*) character(s).
bedtools v2.28.0
% cat bed
chr7 71335 73335 ENSG00000232325
chr7 75538 77538 ENSG00000242611
chr7 144930 146930 ENSG00000242474
chr7 148097 150278 ENSG00000240859
chr7 148862 150966 ENSG00000242474
chr7 151179 153179 ENSG00000240859
chr7 164472 166472 ENSG00000261795
chr7 173420 175420 ENSG00000239715
% cat vcf
7 72339 7_31439_T_A_b37
7 75999 7_31504_G_A_b37
7 146125 7_34713_A_C_b37
7 149978 7_34918_C_T_b37
7 174401 7_35119_G_A_b37