I'm looking for an algorithm that addresses the LCS problem for two strings with the following conditions:
Each string consists of English characters and each character has a weight. For example:
sequence 1 (S1): "ABBCD" with weights [1, 2, 4, 1, 3]
sequence 2 (S2): "TBDC" with weights [7, 5, 1, 2]
Suppose that MW(s, S)
is defined as the maximum weight of the sub-sequence s in string S
with respect to the associated weights. The heaviest common sub-sequence (HCS) is defined as:
HCS = argmin(MW(s, S1), MW(s, S2))
The algorithm output should be the indexes of HCS in both strings and the weight. In this case, the indexes will be:
I_S1 = [2, 4] --> MW("BD", "ABBCD") = 7
I_S2 = [1, 2] --> MW("BD", "TBDC") = 6
Therefore HCS = "BD", and weight = min(MW(s, S1), MW(s, S2)) = 6.
The table that you need to build will have this.
for each position in sequence 1
for each position in sequence 2
for each extreme pair of (weight1, weight2)
(last_position1, last_position2)
Where an extreme pair is one where it is not possible to find a subsequence to that point whose weights in sequence 1 and weights in sequence 2 are both >= and at least one is >.
There may be multiple extreme pairs, where one sequence is higher than the other.
The rule is that at the (i, -1)
or (-1, j)
positions, the only extreme pair is the empty set with weight 0. At any other we merge the extreme pairs for (i-1, j)
and (i, j-1)
. And then if seq1[i] = seq2[j]
, then add the options where you went to (i-1, j-1)
and then included the i
and j
in the respective subsequences. (So add weight1[i]
and weight2[j]
to the weights then do a merge.)
For that merge you can sort by weight1 ascending, all of the extreme values for both previous points, then throw away all of the ones whose weight2 is less than or equal to the best weight2 that was already posted earlier in the sequence.
When you reach the end you can find the extreme pair with the highest min, and that is your answer. You can then walk the data structure back to find the subsequences in question.