c++needleman-wunsch

Needleman algorithm not working when matrix values are the same


I am trying to solve the "Longest Common Subsequence" question using needleman.

Example: Input: text1 = "abcde", text2 = "ace" Output: 3
Explanation: The longest common subsequence is "ace" and its length is 3.

I am very confused on how the algorithm should work for the case where text1 ="ezu" and text2= "ubm".

The Needleman matrix is (assuming the cost of mismatch is -3, cost of gap is -4 and match is 1):

x - u b m
- 0 -4 -8 -12
e -4 -3 -7 -11
z -8 -7 -6 -10
u -12 -7 -10 -9

Now the algorithm states to traceback from the bottom corner of the matrix and:

Thus starting in -9, i have to choose between -10 and -10 (should i move up or down?) and regardless of the decision taken i will never hit the case [i,j] = [3,1]

My Code:

int longestCommonSubsequence(string text1, string text2) {
        
        int m = text1.length();
        int n = text2.length();
        
        vector<vector<int>> vec(m+1, vector(n+1,0));
        int mismatch  = -3;
        int gap = -4;
        int match = 1;
        
        for (int i = 1; i <= m; i++) {
            vec[i][0] = vec[i-1][0] + gap;
        }
        
        for (int i = 1; i <= n; i++) {
            vec[0][i] = vec[0][i-1] + gap;
        }
        
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                int cost = 0;
                if (text1[i-1] != text2[j-1]) {
                    cost = max(vec[i-1][j-1]+ mismatch, min(vec[i-1][j]+gap, vec[i][j-1]+gap));
                }  else {
                    cost = match  + vec[i-1][j-1];
                }
                vec[i][j] = cost;
            }
        }
        
        int matchLen = 0;
        
        int i = m;
        int j = n;
        
        while (i >0 && j >0) {
            
            if (text1[i-1] == text2[j-1]) {
                matchLen++;
                i--;
                j--;
            } else if (i > 0 && j > 0) {
                
                if (vec[i-1][j] > vec[i][j-1]) {
                    i--;
                } else {
                    
                    j--;
                }
            } else {
                break;
            }
            
        }
        
        return matchLen;
        
    }

Thanks!


Solution

  • Each cell of the Needle matrix is actually showing which action is considered best when aligning two sequence:

    based on penalty you assigned to mismatch(-3) and gap (-4) and the score you gave to match (1),your only way to get a match (length =1):

    e z u _ _
    
    _ _ u b m    
    

    will get: gap(-4) + gap(-4) + match(1) + gap(-4) + gap(-4) = -15

    and the algorithm will select the highest score alignment (if your code was right which is not right now because you didn't implement mismatch selection and already assigning scores considering it) which is a total mismatch:

    e z u
    
    u b m
    

    will get: mismatch(-3) + mismatch(-3) + mismatch(-3) = -9

    If you want to have that match to happen in the selected sequence you should consider balancing your score system.