You are given a string and can change at most Q letters in the string. You are also given a list of substrings (each two characters long), with a corresponding score. Each occurance of the substring within the string adds to your total score. What is the maximum possible attainable score?
String length <= 150, Q <= 100, Number of Substrings <= 700
Example:
String = bpdcg
Q = 2
Substrings:
bz - score: 2
zd - score: 5
dm - score: 7
ng - score: 10
In this example, you can achieve the maximum score b changing the "p" in the string to a "z" and the "c" to an "n". Thus, your new string is "bzdng" which has a score of 2+5+10 = 17.
I know that given a string which already has the letters changed, the score can be checked in linear time using a dictionary matching algorithm such as aho-corasick (or with a slightly worse complexity, Rabin Karp). However, trying each two letter substitution will take too long and then checking will take too long.
Another possible method I thought was to work backwards, to construct the ideal string from the given substrings and then check whether it differs by at most two characters from the original string. However, I am not sure how to do this, and even if it could be done, I think that it would also take too long.
What is the best way to go about this?
An efficient way to solve this is to use dynamic programming.
Let L be the set of letters that start any of the length-2 scoring substrings, and a special letter "*" which stands for any other letter than these.
Let S(i, j, c) be the maximum score possible in the string (up to index i) using j substitutions, where the string ends with character c (where c in L).
The recurrence relations are a bit messy (or at least, I didn't find a particularly beautiful formulation of them), but here's some code that computes the largest score possible:
infinity = 100000000
def S1(L1, L2, s, i, j, c, scores, cache):
key = (i, j, c)
if key not in cache:
if i == 0:
if c != '*' and s[0] != c:
v = 0 if j >= 1 else -infinity
else:
v = 0 if j >= 0 else -infinity
else:
v = -infinity
for d in L1:
for c2 in [c] if c != '*' else L2 + s[i]:
jdiff = 1 if s[i] != c2 else 0
score = S1(L1, L2, s, i-1, j-jdiff, d, scores, cache)
score += scores.get(d+c2 , 0)
v = max(v, score)
cache[key] = v
return cache[key]
def S(s, Q, scores):
L1 = ''.join(sorted(set(w[0] for w in scores))) + '*'
L2 = ''.join(sorted(set(w[1] for w in scores)))
return S1(L1, L2, s + '.', len(s), Q, '.', scores, {})
print S('bpdcg', 2, {'bz': 2, 'zd': 5, 'dm': 7, 'ng': 10})
There's some room for optimisation:
Overall, if there's k different letters in the scoring words, the algorithm runs in time O(QN*k^2). With the second optimisation above, this can be reduced to O(QNw) where w is the number of scoring words.