pythonalgorithmdiffsubsequencestring-algorithm

Can someone explain this "Longest Common Subsequence" algorithm?


The Longest Common Subsequence (LCS) problem is: given two sequences A and B, find the longest subsequence that is found both in A and in B. For example, given A = "peterparker" and B = "spiderman", the longest common subsequence is "pera".

Can someone explain this Longest Common Subsequence algorithm?

def longestCommonSubsequence(A: List, B: List) -> int:
    # n = len(A)
    # m = len(B)
    
    indeces_A = collections.defaultdict(list)
    
    # O(n)
    for i, a in enumerate(A):
        indeces_A[a].append(i)
    
    # O(n)
    for indeces_a in indeces_A.values():
        indeces_a.reverse()
    
    # O(m)
    indeces_A_filtered = []
    for b in B:
        indeces_A_filtered.extend(indeces_A[b])
    
    # The length of indeces_A_filtered is at most n*m, but in practice it's more like O(m) or O(n) as far as I can tell.
    iAs = []
    # O(m log m) in practice as far as I can tell.
    for iA in indeces_A_filtered:
        j = bisect.bisect_left(iAs, iA)
        if j == len(iAs):
            iAs.append(iA)
        else:
            iAs[j] = iA
    return len(iAs)

The algorithm as written finds the length of the longest common subsequence, but can be modified to find the longest common subsequence outright.

I found this algorithm as I was looking at the fastest python solutions to an equivalent problem on leetcode link. This algorithm was the fastest python solution (40 ms) for the problem and it also seems to have O(m log m) time complexity, which is much better than the O(m*n) time complexity of most other solutions.

I do not fully understand why it works and tried looking all over for known algorithms to the Longest Common Subsequence problem to find other mentions of it, but couldn't find anything like it. The closest thing I could find was the Hunt–Szymanski algorithm link which is also said to have O(m log m) in practice, but does not seem to be the same algorithm.

What I kind of understand:

  1. indeces_a are reversed so that in the iAs for loop, the smaller index is kept (this is more apparent when doing the walkthrough below.)
  2. As far as I can tell, the iAs for loop finds the longest increasing subsequence of indeces_A_filtered.

Thanks!


Here's a walkthrough of the algorithm for example A = "peterparker" and B = "spiderman"

     01234567890
A = "peterparker"
B = "spiderman"

indeces_A = {'p':[0,5], 'e':[1,3,9], 't':[2], 'r':[4,7,10], 'a':[6], 'k':[8]}

# after reverse
indeces_A = {'p':[5,0], 'e':[9,3,1], 't':[2], 'r':[10,7,4], 'a':[6], 'k':[8]}

#                     -p-  --e--  ---r--  a
indeces_A_filtered = [5,0, 9,3,1, 10,7,4, 6]

# the `iAs` loop

iA = 5
j = 0
iAs = [5]

iA = 0
j = 0
iAs = [0]

iA = 9
j = 1
iAs = [0,9]

iA = 3
j = 1
iAs = [0,3]

iA = 1
j = 1
iAs = [0,1]

iA = 10
j = 2
iAs = [0,1,10]

iA = 7
j = 2
iAs = [0,1,7]

iA = 4
j = 2
iAs = [0,1,4]

iA = 6
j = 3
iAs = [0,1,4,6] # corresponds to indices of A that spell out "pera", the LCS

return len(iAs) # 4, the length of the LCS

Solution

  • The missing bit here is "patience sorting", whose connection to longest increasing subsequence (LIS) is a bit subtle but well known. The final loop in the code is a bare bones implementation of patience sorting with "the greedy strategy". It does not, in general, compute a LIS directly, but rather the length of a LIS.

    An easy-enough correctness proof, which includes a sketch of what's needed to reliably compute a LIS too (not just its length), can be found as Lemma 1 early in

    "Longest Increasing Subsequences: From Patience Sorting to the Baik-Deift-Johansson Theorem" David Aldous and Persi Diaconis