javaalgorithmlcs

Longest Common Subsequence with adjacent elements in k-window


Given two Strings and a window size k: Find the longest common subsequence with a constraint. The constraint is: adjacent elements in the common subsequence are within a k-window

(The constraint ideally applies to both the input strings. But it's fine if the second string can be satisfied)

Eg: A = "carpani"

B = "blarpan sharlie paneaui"

and k = 3

output: arpan (not arpani).

Can someone tell me how to solve this problem? It would be great if someone can post the pseudocode.


Solution

  • This is a dynamic programming problem. They can be solved in two basic ways. One is to write a recursive function, then memoize. And the other is to build a data structure from the bottom up.

    The top down approach is usually easier to write. The bottom up approach is often more efficient. That is why it is good to learn both.

    I will demonstrate the top down approach in Python.

    Consider the following function:

    def best_k_match_ending_at(string1, string2, k, i, j):
        if string1[i] != string2[j]:
            return (0, None, None)
        else:
            best = (0, None, None)
            for i_old in range(max(i-k, 0), i):
                for j_old in range(max(j-k, 0), j):
                    this = best_k_match_ending_at(string1, string2, k, i_old, j_old)
                    best = max(best, this)
            return (best[0] + 1, (i, best[1]), (j, best[2]))
    
    def best_k_match(string1, string2, k):
        best = (0, None, None)
        for i in range(len(string1)):
            for j in range(len(string2)):
                best = max(best, best_k_match_ending_at(string1, string2, k, i, j))
        return best
    
    # prints (5, (5, (4, (3, (2, (1, None))))), (6, (5, (4, (3, (2, None)))))
    print(best_k_match('carpani', 'blarpan sharlie paneaui', 3))
    

    This is horribly inefficient. But correct. Now one step before memoizing it. I like to refactor to move the helper function inside the main one. The logic is the same, but when I memoize, it lets me know when I'm done with the data.

    def best_k_match(string1, string2, k):
    
        def best_ending_at(i, j):
            if string1[i] != string2[j]:
                return (0, None, None)
            else:
                best = (0, None, None)
                for i_old in range(max(i-k, 0), i):
                    for j_old in range(max(j-k, 0), j):
                        this = best_ending_at(i_old, j_old)
                        best = max(best, this)
                return (best[0] + 1, (i, best[1]), (j, best[2]))
    
        best = (0, None, None)
        for i in range(len(string1)):
            for j in range(len(string2)):
                best = max(best, best_ending_at(i, j))
        return best
    
    print(best_k_match('carpani', 'blarpan sharlie paneaui', 3))
    

    And now I memoize

    def best_k_match(string1, string2, k):
        memoized = {}
    
        def best_ending_at(i, j):
            if string1[i] != string2[j]:
                return (0, None, None)
            elif (i, j) not in memoized:
                best = (0, None, None)
                for i_old in range(max(i-k, 0), i):
                    for j_old in range(max(j-k, 0), j):
                        this = best_ending_at(i_old, j_old)
                        best = max(best, this)
                memoized[(i, j)] = (best[0] + 1, (i, best[1]), (j, best[2]))
            return memoized[(i, j)]
    
        best = (0, None, None)
        for i in range(len(string1)):
            for j in range(len(string2)):
                best = max(best, best_ending_at(i, j))
        return best
    
    print(best_k_match('carpani', 'blarpan sharlie paneaui', 3))
    

    This is now efficient, but you might not like the output very much. Because it is a linked list in reverse order. Here is a nicer output.

    def best_k_match(string1, string2, k):
        memoized = {}
    
        def best_ending_at(i, j):
            if string1[i] != string2[j]:
                return (0, None, None)
            elif (i, j) not in memoized:
                best = (0, None, None)
                for i_old in range(max(i-k, 0), i):
                    for j_old in range(max(j-k, 0), j):
                        this = best_ending_at(i_old, j_old)
                        best = max(best, this)
                memoized[(i, j)] = (best[0] + 1, (i, best[1]), (j, best[2]))
            return memoized[(i, j)]
    
        best = (0, None, None)
        for i in range(len(string1)):
            for j in range(len(string2)):
                best = max(best, best_ending_at(i, j))
    
        # Turn linked lists to something nicer.
        best_seq_rev = []
        best_match_rev = []
        best_link_1 = best[1]
        best_link_2 = best[2]
        while best_link_1 is not None:
            best_seq_rev.append(string1[best_link_1[0]])
            best_match_rev.append((best_link_1[0], best_link_2[0]))
            best_link_1 = best_link_1[1]
            best_link_2 = best_link_2[1]
        best_seq = "".join(reversed(best_seq_rev))
        best_match = list(reversed(best_match_rev))
        return (best[0], best_seq, best_match)
    
    # prints (5, 'arpan', [(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)])
    print(best_k_match('carpani', 'blarpan sharlie paneaui', 3))
    

    If the strings are of length n and m, this will be O(n*m*k^2).