rbioinformaticsbedtools

Parse out gene name after bed and vcf intersect


I'm looking for to add a new column of 'GENE_ID' to the vcf file based on an intersect between bed coordinate and vcf positions. The 'GENE_ID' is already present as the 4th column in the bed file. Basically I need to know how to keep that gene name column (GENE_ID) in the bed file into the new vcf file!

Moreover, I was wondering how to get the result from multiple vcf and bed file in one run.

Let's suppose bed file as follow:

> coord.bed
chr7    71335   73335   ENSG00000232325
chr7    75538   77538   ENSG00000242611
chr7    144930  146930  ENSG00000242474
chr7    148097  150278  ENSG00000240859
chr7    148862  150966  ENSG00000242474
chr7    151179  153179  ENSG00000240859
chr7    164472  166472  ENSG00000261795
chr7    173420  175420  ENSG00000239715

and the first three columns of vcf file:

#CHROM   POS          ID 
7       72339   7_31439_T_A_b37
7       75999   7_31504_G_A_b37
7       146125   7_34713_A_C_b37
7       149978   7_34918_C_T_b37
7       174401   7_35119_G_A_b37

the desired output:

#CHROM   POS           ID               GENE_ID
7       72339   7_31439_T_A_b37     ENSG00000232325  
7       75999   7_31504_G_A_b37     ENSG00000242611   
7       146125   7_34713_A_C_b37    ENSG00000242474
7       150478   7_34918_C_T_b37    ENSG00000242474 
7       174401   7_35119_G_A_b37    ENSG00000239715

I'd appreciate any suggestion!


Solution

  • Here's one way with bedtools. The awks temporarily present a correct format to bedtools and print the desired end result.

    % bedtools intersect\
       -a <(awk -v OFS="\t" '{print "chr"$1,$2,$2,$3}' vcf)\
       -b <(awk -v OFS="\t" '{print $1,$2,$3,$4}' bed)\
       -wb | awk -v OFS="\t" '{print substr($1,4,length($1)-3),$2,$4,$8}'
    7   72339   7_31439_T_A_b37 ENSG00000232325
    7   75999   7_31504_G_A_b37 ENSG00000242611
    7   146125  7_34713_A_C_b37 ENSG00000242474
    7   149978  7_34918_C_T_b37 ENSG00000240859
    7   149978  7_34918_C_T_b37 ENSG00000242474
    7   174401  7_35119_G_A_b37 ENSG00000239715
    

    See bedtools intersect --help for multiple files:

    Note: -b may be followed with multiple databases and/or wildcard (*) character(s).

    bedtools v2.28.0

    Data

    % cat bed
    chr7 71335   73335   ENSG00000232325
    chr7 75538   77538   ENSG00000242611
    chr7 144930  146930  ENSG00000242474
    chr7 148097  150278  ENSG00000240859
    chr7 148862  150966  ENSG00000242474
    chr7 151179  153179  ENSG00000240859
    chr7 164472  166472  ENSG00000261795
    chr7 173420  175420  ENSG00000239715
    
    % cat vcf
    7  72339   7_31439_T_A_b37
    7  75999   7_31504_G_A_b37
    7  146125  7_34713_A_C_b37
    7  149978  7_34918_C_T_b37
    7  174401  7_35119_G_A_b37