I have a large BLAST (outfmt 6) output file. I am interested in finding reciprocal homologs within this file but I want to exclude hits with multiple HSPs, eg.
Seq1 Seq2 (alignment 1: evalue bitscore etc)
Seq1 Seq2 (alignment 2: evalue bitscore etc)
Seq3 Seq4 (alignment 1: evalue bitscore etc)
Seq4 Seq5 (alignment 1: evalue bitscore etc)
Seq2 Seq1 (alignment 1: evalue bitscore etc)
Seq2 Seq1 (alignment 2: evalue bitscore etc)
Seq4 Seq3 (alignment 1: evalue bitscore etc)
In this instance only the alignments between sequences 3 and 4 would be returned as the alignments between 1 and 2 share multiple HSPs and the alignments between 4 and 5 only have a one-directional hit. I am hoping to do this in python so I can plug it in with the rest of my program.
Can anybody advise on any threads (etc) that might lead me in the right direction?
Thanks!
Python would be just fine for this.
from collections import defaultdict
hsp_count = defaultdict(int)
for line in open("file"):
seq1, seq2, _ = line.split(maxsplit=2)
hsp_count[seq1, seq2] += 1
already_checked = set()
for (seq1, seq2), val1 in hsp_count.items():
if (seq2, seq1) in already_checked:
continue
val2 = hsp_count.get((seq2, seq1))
if not val2:
continue
already_checked.add((seq1, seq2))
if val1 == 1 and val2 == 1:
print(seq1, seq2)