The Longest Common Subsequence (LCS)
problem is: given two sequences A
and B
, find the longest subsequence that is found both in A
and in B
. For example, given A = "peterparker"
and B = "spiderman"
, the longest common subsequence is "pera"
.
Can someone explain this Longest Common Subsequence
algorithm?
def longestCommonSubsequence(A: List, B: List) -> int:
# n = len(A)
# m = len(B)
indeces_A = collections.defaultdict(list)
# O(n)
for i, a in enumerate(A):
indeces_A[a].append(i)
# O(n)
for indeces_a in indeces_A.values():
indeces_a.reverse()
# O(m)
indeces_A_filtered = []
for b in B:
indeces_A_filtered.extend(indeces_A[b])
# The length of indeces_A_filtered is at most n*m, but in practice it's more like O(m) or O(n) as far as I can tell.
iAs = []
# O(m log m) in practice as far as I can tell.
for iA in indeces_A_filtered:
j = bisect.bisect_left(iAs, iA)
if j == len(iAs):
iAs.append(iA)
else:
iAs[j] = iA
return len(iAs)
The algorithm as written finds the length of the longest common subsequence
, but can be modified to find the longest common subsequence
outright.
I found this algorithm as I was looking at the fastest python solutions to an equivalent problem on leetcode link. This algorithm was the fastest python solution (40 ms) for the problem and it also seems to have O(m log m)
time complexity, which is much better than the O(m*n)
time complexity of most other solutions.
I do not fully understand why it works and tried looking all over for known algorithms to the Longest Common Subsequence
problem to find other mentions of it, but couldn't find anything like it. The closest thing I could find was the Hunt–Szymanski algorithm
link which is also said to have O(m log m)
in practice, but does not seem to be the same algorithm.
What I kind of understand:
indeces_a
are reversed so that in the iAs
for loop, the smaller index is kept (this is more apparent when doing the walkthrough below.)iAs
for loop finds the longest increasing subsequence
of indeces_A_filtered
.Thanks!
Here's a walkthrough of the algorithm for example A = "peterparker"
and B = "spiderman"
01234567890
A = "peterparker"
B = "spiderman"
indeces_A = {'p':[0,5], 'e':[1,3,9], 't':[2], 'r':[4,7,10], 'a':[6], 'k':[8]}
# after reverse
indeces_A = {'p':[5,0], 'e':[9,3,1], 't':[2], 'r':[10,7,4], 'a':[6], 'k':[8]}
# -p- --e-- ---r-- a
indeces_A_filtered = [5,0, 9,3,1, 10,7,4, 6]
# the `iAs` loop
iA = 5
j = 0
iAs = [5]
iA = 0
j = 0
iAs = [0]
iA = 9
j = 1
iAs = [0,9]
iA = 3
j = 1
iAs = [0,3]
iA = 1
j = 1
iAs = [0,1]
iA = 10
j = 2
iAs = [0,1,10]
iA = 7
j = 2
iAs = [0,1,7]
iA = 4
j = 2
iAs = [0,1,4]
iA = 6
j = 3
iAs = [0,1,4,6] # corresponds to indices of A that spell out "pera", the LCS
return len(iAs) # 4, the length of the LCS
The missing bit here is "patience sorting", whose connection to longest increasing subsequence (LIS) is a bit subtle but well known. The final loop in the code is a bare bones implementation of patience sorting with "the greedy strategy". It does not, in general, compute a LIS directly, but rather the length of a LIS.
An easy-enough correctness proof, which includes a sketch of what's needed to reliably compute a LIS too (not just its length), can be found as Lemma 1 early in
"Longest Increasing Subsequences: From Patience Sorting to the Baik-Deift-Johansson Theorem" David Aldous and Persi Diaconis