I want to compare two tab-separated files (with headers) and check whether column 6 of the first file (peptide.tsv) matches exactly with column 2 of the second file (proteome.tsv). If there is an exact match, print Fail, otherwise print Pass. Also col6 may not be unique. There can be identical sequences with different locus.
==\> test_peptide.tsv \<==
FusionID Peptide HLA_Allele EL-score EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNOVEL*PEPTIDE
==\> test_proteome.tsv \<==
Name AA_seq
sp|A0A0G2JMI3|HV692_HUMAN Immunoglobulin heavy variable 1-69-2 OS=Homo sapiens OX=9606 GN=IGHV1-69-2 PE=1 SV=2 MDCTWRILLLVAAATGTHAEVQLVQSGAEVKKPGATVKISCKVSGYTFTDYYMHWVQQAPGKGLEWMGLVDPEDGETIYAEKFQGRVTITADTSTDTAYMELSSLRSEDTAVYYCA
rp|A0A0G2JMI3|HV692_HUMAN Immunoglobulin heavy variable 1-69-2 OS=Homo sapiens OX=9606 GN=IGHV1-69-2 PE=1 SV=6 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPPT
What I am getting:
FusionIDPass Peptide HLA_Allele EL-score EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRPassAPWKVMFVVWSLLAQETQ*SRETCRRPP
TUFT1--PPass1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNOVEL*PEPTIDE
Tried these
awk -F\\t 'NR==FNR {a\[$2\];next} {print $0 FS (($6) in a?"Fail":"Pass")}' test_proteome.tsv test_peptide.tsv
awk 'NR==FNR {proteome\[$2\]; next} {status = ($6 in proteome ? "Fail" : "Pass"); print $0 "\\t" status}' test_proteome.tsv test_peptide.tsv
expected_output.tsv
FusionID Peptide HLA_Allele EL-score EL_Rank AA_seq_full
TUFT1--PBXIP1|chr1:151540426:+|chr1:154951535:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNGTRNWCTLVDVHPEDQAAGRSSRLKALSLAAF*QRRLRSRAPWKVMFVVWSLLAQETQ*SRETCRRPP Fail
TUFT1--PBXIP1|chr1:151540426:+|chr1:154948366:-|FRAMESHIFT GSDGCWGLY HLA-A01:01 0.79990000000000006 0.1002 MNOVEL\*PEPTIDE Pass
An example using a bash script in a didactic way, not compressed in a one liner awk command
#!/bin/bash
# Input files
file1="test_peptide.tsv"
file2="test_proteome.tsv"
# Output file
output_file="output.tsv"
# Column numbers to compare
column1=6
column2=2
#write the headers to the output file
printf '%s\tResult\n' "$(head -n 1 $file1)" > $output_file
#for row 2 to the number of lines in the file1
for i in $(seq 2 "$(cat $file1 | wc -l)")
do
if [ "$(awk -F '\t' -v i="$i" -v col1="$column1" 'NR==i {print $col1}' $file1)" \
== "$(awk -F '\t' -v i="$i" -v col2="$column2" 'NR==i {print $col2}' $file2)" ]
then
printf '%s\tFail\n' "$(awk -v i="$i" 'NR==i {print $0}' $file1)" >> $output_file
else
printf '%s\tPass\n' "$(awk -v i="$i" 'NR==i {print $0}' $file1)" >> $output_file
fi
done