linuxbashawkbioinformatics

How to compare two tab-separated files and mark 'Fail' if col6 of file1 matches col2 of file2, else 'Pass'?


I want to compare two tab-separated files (with headers) and check whether column 6 of the first file (peptide.tsv) matches exactly with column 2 of the second file (proteome.tsv). If there is an exact match, print Fail, otherwise print Pass. Also col6 may not be unique. There can be identical sequences with different locus.

==\> test_peptide.tsv \<==
FusionID        Peptide HLA_Allele      EL-score        EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNOVEL*PEPTIDE

==\> test_proteome.tsv \<==
Name    AA_seq
sp|A0A0G2JMI3|HV692_HUMAN Immunoglobulin heavy variable 1-69-2 OS=Homo sapiens OX=9606 GN=IGHV1-69-2 PE=1 SV=2  MDCTWRILLLVAAATGTHAEVQLVQSGAEVKKPGATVKISCKVSGYTFTDYYMHWVQQAPGKGLEWMGLVDPEDGETIYAEKFQGRVTITADTSTDTAYMELSSLRSEDTAVYYCA
rp|A0A0G2JMI3|HV692_HUMAN Immunoglobulin heavy variable 1-69-2 OS=Homo sapiens OX=9606 GN=IGHV1-69-2 PE=1 SV=6  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPPT

What I am getting:

FusionIDPass    Peptide HLA_Allele      EL-score        EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRPassAPWKVMFVVWSLLAQETQ*SRETCRRPP
TUFT1--PPass1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNOVEL*PEPTIDE

Tried these

awk -F\\t 'NR==FNR {a\[$2\];next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv

awk 'NR==FNR {proteome\[$2\]; next} {status = ($6 in proteome ? "Fail" : "Pass"); print $0 "\\t" status}' test_proteome.tsv test_peptide.tsv

expected_output.tsv

FusionID        Peptide HLA_Allele      EL-score        EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP Fail
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT      GSDGCWGLY       HLA-A01:01      0.79990000000000006     0.1002  MNOVEL\*PEPTIDE Pass

Solution

  • An example using a bash script in a didactic way, not compressed in a one liner awk command

    #!/bin/bash
    
    # Input files
    file1="test_peptide.tsv"
    file2="test_proteome.tsv"
    
    # Output file
    output_file="output.tsv"
    
    # Column numbers to compare
    column1=6
    column2=2
    
    #write the headers to the output file
    printf '%s\tResult\n' "$(head -n 1 $file1)" > $output_file
    
    #for row 2 to the number of lines in the file1
    for i in $(seq 2 "$(cat $file1 | wc -l)")
    do
        if [ "$(awk -F '\t' -v i="$i" -v col1="$column1" 'NR==i {print $col1}' $file1)" \
        == "$(awk -F '\t' -v i="$i" -v col2="$column2" 'NR==i {print $col2}' $file2)" ]
        then
            printf '%s\tFail\n' "$(awk -v i="$i" 'NR==i {print $0}' $file1)" >> $output_file
        else
            printf '%s\tPass\n' "$(awk -v i="$i" 'NR==i {print $0}' $file1)" >> $output_file
        fi
    done