awk

Compare the column 2 of File 1 with column 4 and 5 of file 2


I have a tab delimited file_1

NC_025       4569   .   KX838946.2      
NC_025       16546  .   KJ641660.1      
NC_025       11996  .   KX932454.2

file_2

NC_025.1     RefSeq  gene    5690    7513    .       +       .       ID=gene-NZ82_gp4;Dbxref=GeneID:20964334;Name=NZ82_gp4;gbkey=Gene;gene_biotype=protein_coding;locus_tag=NZ82_gp4
NC_025.1     RefSeq  gene    4612    10046   .       +       .       ID=gene-NZ82_gp5;Dbxref=GeneID:20964335;Name=NZ82_gp5;gbkey=Gene;gene_biotype=protein_coding;locus_tag=NZ82_gp5
NC_025.1     RefSeq  gene    10337   16933   .       +       .       ID=gene-NZ82_gp6;Dbxref=GeneID:20964336;Name=NZ82_gp6;gbkey=Gene;gene_biotype=protein_coding;locus_tag=NZ82_gp6
NC_025.1     RefSeq  gene    9000    12000    .      +       .       ID=gene-AL82_gp5;Dbxref=GeneID:109647334;Name=AL82_gp5;gbkey=Gene;gene_biotype=protein_coding;locus_tag=AL82_gp5

I want to compare column 2 of file 1 with column 4 and 5 of file 2. If column 2 of file_1 is >= column 4 and <= column5 of same row of file 2, I want to combine the whole line of file_1 and file_2

NC_025       16546   .   KJ641660.1     NC_025.1     RefSeq  gene    10337   16933   .       +       .       ID=gene-NZ82_gp6;Dbxref=GeneID:20964336;Name=NZ82_gp6;gbkey=Gene;gene_biotype=protein_coding;locus_tag=1NZ82_gp6    
NC_025       11996   .   KX932454.2     NC_025.1     RefSeq  gene    10337   16933   .       +       .       ID=gene-NZ82_gp6;Dbxref=GeneID:20964336;Name=NZ82_gp6;gbkey=Gene;gene_biotype=protein_coding;locus_tag=1NZ82_gp6
NC_025       11996   .   KX932454.2     NC_025.1     RefSeq  gene    9000    12000    .       +       .       ID=gene-AL82_gp5;Dbxref=GeneID:109647334;Name=AL82_gp5;gbkey=Gene;gene_biotype=protein_coding;locus_tag=AL82_gp5

I have tried :

awk '{
  if (NR==FNR) {
  l[NR]=$0
  a[NR]=$2
 }
  else if (a[FNR]>=$4 && a[FNR]<=$5) {
  print l[FNR],$0
 }
}' file_1 file_2 > File_3

But it prints nothing.


Solution

  • So, you basically want to join all lines using a range criteria. After storing the first file, you need to iterate over its lines for each line in the second file.

    awk '
      NR==FNR {a[NR]=$0; p[NR]=$2; next}
              {for (n in a) if ($4<=p[n] && p[n]<=$5) print a[n] "\t" $0}
    ' file_1.txt file_2.txt > file_3.txt