shellawkcompareblast

How to compare data in different row AND column?


Hope you're well. I have found questions that come close to what I'm asking but none exactly. I've been struggling on this for the last 2 weeks and have finally managed to make a bit of progress enough that I feel I can justify asking on here!

I have downloaded a HitTable from a sequence I BLASTED

accno    percent     seqstart seqend        
AC020656|33 84.713  116580  116735  
AC020656|33 90.303  118279  118443  
AC020656|33 87.654  120390  121470  
AC020656|33 82.609  121323  121390      
AC123694|11 77.622  158333  158474  
AC123694|11 84.559  158238  160142  

What I am trying to do is find out is which of these hits for each entry (accession number) are overlapping and which are not so I can carry on with the rest of my pipeline. I'm aiming to put the IDs (or whole row, I'm not fussed at this point!) into separate files and then use them to extract the corresponding FASTA files so I can apply the more appropriate programme depending on if it's overlapping or not.

I have felt the best way to go about it is using awk to:

  1. CHECK that the IDs ($1) are the same (no use checking together AC020656|33 line 4 and AC123694|11 line 4!)
  2. COMPARE if the seq end ($4) is smaller than the seq start ($3) of the next row. If so, print it in a file called "nonover.txt", else print to "overlap.txt"

I began an attempt at this answer by modifying code found HERE:

awk '($1==c1 && $3==c3 && $4==c4){print line RS $0}{line=$0;c1=$1;c4<$3}' mydata.txt

But, no surprise, it doesn't work as I am clearly missing something. The OP in the answer I linked very kindly explained it in such a way that I felt confident tweaking it but well, thats my incompetence!

I've also tried to apply my problem using code found HERE and HERE which I've felt have come close to what I'm trying to do. I've also checked out the manual for awk and while it did help a bit (I tried using the getline function but just kept getting smacked with errors), I just honestly don't think I am well versed enough to tackle this issue at the minute.

And as pointed out by markp-fuso, my intended output for the above would ideally be two files with the following data inside:

noverlap.txt (as each rows seqend is smaller than the next rows seqstart, therefore it's not overlapped)
accno    percent     seqstart seqend        
AC020656|33 84.713  116580  116735  
AC020656|33 90.303  118279  118443  
AC020656|33 87.654  120390  121470

overlap.txt (as each rows seqend is larger than the next rows seqstart, and it is overlapping)
accno    percent     seqstart seqend
AC020656|33 87.654  120390  121470  
AC020656|33 82.609  121323  121390  

AC123694|11 77.622  158333  158474  
AC123694|11 84.559  158238  160142 

As pointed out by Ed Morton, how should acc.no be treated if some entries overlap but others don't - it's fine for them to be split up with some entries in both noverlap.txt and overlap.txt. I will be checking for any identical acc.no between the two folders, will treat the overlap first to then add to the nonoverlapping entries and will carry on from there. Duplicates are fine here (see AC020656|33 87.654 120390 121470 in both txt files), I know how I'm treating these, it's just so I can confirm the method to use on my real data.

TL;DR: Using grouping based on id (Acc. no), can I compare data from a column with data in a different column and the row below? Happy with either loop, script or one/two line answers suitable for a OS user

Thank you kindly in advance, any advice is very welcome and I appreciate you taking the time out to read/answer my question.

UPDATE: Thanks to the wonderful Ed Morton for his solution that has done the trick perfectly. I am just adding what I am doing to remove the single, duplicated entries found in the non-overlapping txt file (but found where they should be in overlap) which is modifying the code found at this answer HERE


Solution

  • This will produce the provided expected output from the provided sample input:

    $ cat tst.awk
    { sub(/\r$/,"") }
    NR == 1 { hdr = $0 }
    NR  > 2 { prt() }
    { prev = $0 }
    
    function prt(   over, noover, p, out) {
        over   = "overlap.txt"
        noover = "noover.txt"
    
        if ( !doneHdr++ ) {
            print hdr > over
            print hdr > noover
        }
    
        split(prev,p)
        if ( ($1 == p[1]) && ($3 <= p[4]) ) {
            print prev > over
            print $0   > over
            print ""   > over
        }
        else {
            print prev > noover
        }
    }
    

    $ awk -f tst.awk file
    $ head *over*
    ==> noover.txt <==
    accno    percent     seqstart seqend
    AC020656|33 84.713  116580  116735
    AC020656|33 90.303  118279  118443
    AC020656|33 82.609  121323  121390
    
    ==> overlap.txt <==
    accno    percent     seqstart seqend
    AC020656|33 87.654  120390  121470
    AC020656|33 82.609  121323  121390
    
    AC123694|11 77.622  158333  158474
    AC123694|11 84.559  158238  160142
    

    If that's not all you need then edit your question to provide more truly representative sample input/output that includes cases that the above doesn't work for.

    Note that the above requires at least 2 data lines in addition to the header to be present in the input. If there was only 1 data line it wouldn't be printed. If that's an issue add some logic to print the hdr and prev in an END section if NR is less than 3 or similar.