Is there a way I can get the position number of the mismatch from the following BLAT result using Python?
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41629392 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41629348
As we can see, there are two mismatches in the above output. Can we get the position number of the mismatch/mutation using Python. This is how it appears in the source code also. So I'm a little confused on how to proceed. Thank you.
You can find the mismatches using the .find
method of a string. Mismatches are indicated by a space (' '), so we look for that in the middle line of the blat output. I don't know blat personally, so I'm not sure if the output always comes in triplet lines, but assuming it does, the following function will return a list of positions mismatching, each position represented as a tuple of the mismatching position in the top sequence, and the same in the bottom sequence.
blat_src = """00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41629392 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41629348"""
def find_mismatch(blat):
#break the blat input into lines
lines = blat.split("\n")
#give some firendly names to the different lines
seq_a = lines[0]
seq_b = lines[2]
#We're not interested in the '<' and '>' so we strip them out with a slice
matchstr = lines[1][9:-9]
#Get the integer values of the starts of each sequence segment
pos_a = int(seq_a[:8])
pos_b = int(seq_b[:8])
results = []
#find the index of first space character, mmpos = mismatch position
mmpos = matchstr.find(" ")
#if a space exists (-1 if none found)
while mmpos != -1:
#the position of the mismatch is the start position of the
#sequence plus the index within the segment
results.append((posa+mmpos, posb+mmpos))
#search the rest of the string (from mmpos+1 onwards)
mmpos = matchstr.find(" ", mmpos+1)
return results
print find_mismatch(blat_src)
Which produces
[(28, 41629419), (29, 41629420)]
Telling us positions 28 and 29 (indexed according to the top sequence) or positions 41629419 and 41629420 (indexed according to the bottom sequence) are mismatched.