pythonblat

UCSC BLAT output python


Is there a way I can get the position number of the mismatch from the following BLAT result using Python?

00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< |||||||||||||||||||||||||||  |||||||||||||||| <<<<<<<<
41629392 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41629348

As we can see, there are two mismatches in the above output. Can we get the position number of the mismatch/mutation using Python. This is how it appears in the source code also. So I'm a little confused on how to proceed. Thank you.


Solution

  • You can find the mismatches using the .find method of a string. Mismatches are indicated by a space (' '), so we look for that in the middle line of the blat output. I don't know blat personally, so I'm not sure if the output always comes in triplet lines, but assuming it does, the following function will return a list of positions mismatching, each position represented as a tuple of the mismatching position in the top sequence, and the same in the bottom sequence.

    blat_src = """00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
    <<<<<<<< |||||||||||||||||||||||||||  |||||||||||||||| <<<<<<<<
    41629392 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41629348"""
    
    def find_mismatch(blat):
        #break the blat input into lines
        lines = blat.split("\n")
    
        #give some firendly names to the different lines
        seq_a = lines[0]
        seq_b = lines[2]
    
        #We're not interested in the '<' and '>' so we strip them out with a slice
        matchstr = lines[1][9:-9]
    
        #Get the integer values of the starts of each sequence segment
        pos_a = int(seq_a[:8])
        pos_b = int(seq_b[:8])
    
        results = []
    
        #find the index of first space character, mmpos = mismatch position
        mmpos = matchstr.find(" ")
        #if a space exists (-1 if none found)
        while mmpos != -1:
            #the position of the mismatch is the start position of the
            #sequence plus the index within the segment
            results.append((posa+mmpos, posb+mmpos))
            #search the rest of the string (from mmpos+1 onwards)
            mmpos = matchstr.find(" ", mmpos+1)
        return results
    
    print find_mismatch(blat_src)
    

    Which produces

    [(28, 41629419), (29, 41629420)]
    

    Telling us positions 28 and 29 (indexed according to the top sequence) or positions 41629419 and 41629420 (indexed according to the bottom sequence) are mismatched.