pythonalgorithmdna-sequencesequence-alignment

Implementing Smith-Waterman algorithm for local alignment in python


I have created a sequence alignment tool to compare two strands of DNA (X and Y) to find the best alignment of substrings from X and Y. The algorithm is summarized here (https://en.wikipedia.org/wiki/Smith–Waterman_algorithm). I have been able to generate a lists of lists, filling them all with zeros, to represent my matrix. I created a scoring algorithm to return a numerical score for each kind of alignment between bases (eg. plus 4 for a match). Then I created an alignment algorithm that should put a score in each coordinate of my "matrix". However, when I go to print the matrix, it only returns the original with all zeros (rather than actual scores).

I know there are other methods of implementing this method (with numpy for example), so could you please tell me why this specific code (below) does not work? Is there a way to modify it, so that it does work?

code:

def zeros(X: int, Y: int):
    lenX = len(X) + 1
    lenY = len(Y) + 1
    matrix = []
    for i in range(lenX):
        matrix.append([0] * lenY)
    
    def score(X, Y):
        if X[n] == Y[m]: return 4
        if X[n] == '-' or Y[m] == '-': return -4
        else: return -2

    def SmithWaterman(X, Y, score):
        for n in range(1, len(X) + 1):
            for m in range(1, len(Y) + 1):
                align = matrix[n-1, m-1] + (score(X[n-1], Y[m-1]))
                indelX = matrix[n-1, m] + (score(X[n-1], Y[m]))
                indelY = matrix[n, m-1] + (score(X[n], Y[m-1]))
        matrix[n, m] = max(align, indelX, indelY, 0)
    print(matrix)

zeros("ACGT", "ACGT")

output:

[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]

Solution

  • The reason it's just printing out the zeroed out matrix is that the SmithWaterman function is never called, so the matrix is never updated.

    You would need to do something like

    # ...
    SmithWaterman(X, Y, score)
    print(matrix)
    # ...
    

    However, If you do this, you will find that this code is actually quite broken in many other ways. I've gone through and annotated some of the syntax errors and other issues with the code:

    def zeros(X: int, Y: int):
        #        ^       ^  incorrect type annotations. should be str
        lenX = len(X) + 1
        lenY = len(Y) + 1
        matrix = []
        for i in range(lenX):
            matrix.append([0] * lenY)
        # A more "pythonic" way of expressing the above would be:
        # matrix = [[0] * len(Y) + 1 for _ in range(len(x) + 1)]
        
        def score(X, Y):
            #     ^  ^ shadowing variables from outer scope. this is not a bug per se but it's considered bad practice
            if X[n] == Y[m]: return 4
            #    ^       ^  variables not defined in scope
            if X[n] == '-' or Y[m] == '-': return -4
            #    ^              ^  variables not defined in scope
            else: return -2
    
        def SmithWaterman(X, Y, score): # this function is never called
            #                   ^ unnecessary function passed as parameter. function is defined in scope
            for n in range(1, len(X) + 1):
                for m in range(1, len(Y) + 1):
                    align = matrix[n-1, m-1] + (score(X[n-1], Y[m-1]))
                    #                 ^ invalid list lookup. should be: matrix[n-1][m-1]
                    indelX = matrix[n-1, m] + (score(X[n-1], Y[m]))
                    #                                          ^ out of bounds error when m == len(Y)
                    indelY = matrix[n, m-1] + (score(X[n], Y[m-1]))
                    #                                  ^ out of bounds error when n == len(X)
            matrix[n, m] = max(align, indelX, indelY, 0)
            # this should be nested in the inner for-loop. m, n, indelX, and indelY are not defined in scope here
        print(matrix)
    
    zeros("ACGT", "ACGT")