stringalgorithmsearchstring-hashing

Changing letters of a string to obtain maximum score


You are given a string and can change at most Q letters in the string. You are also given a list of substrings (each two characters long), with a corresponding score. Each occurance of the substring within the string adds to your total score. What is the maximum possible attainable score?

String length <= 150, Q <= 100, Number of Substrings <= 700


Example:

String = bpdcg

Q = 2

Substrings:

bz - score: 2

zd - score: 5

dm - score: 7

ng - score: 10

In this example, you can achieve the maximum score b changing the "p" in the string to a "z" and the "c" to an "n". Thus, your new string is "bzdng" which has a score of 2+5+10 = 17.

I know that given a string which already has the letters changed, the score can be checked in linear time using a dictionary matching algorithm such as aho-corasick (or with a slightly worse complexity, Rabin Karp). However, trying each two letter substitution will take too long and then checking will take too long.

Another possible method I thought was to work backwards, to construct the ideal string from the given substrings and then check whether it differs by at most two characters from the original string. However, I am not sure how to do this, and even if it could be done, I think that it would also take too long.

What is the best way to go about this?


Solution

  • An efficient way to solve this is to use dynamic programming.

    Let L be the set of letters that start any of the length-2 scoring substrings, and a special letter "*" which stands for any other letter than these.

    Let S(i, j, c) be the maximum score possible in the string (up to index i) using j substitutions, where the string ends with character c (where c in L).

    The recurrence relations are a bit messy (or at least, I didn't find a particularly beautiful formulation of them), but here's some code that computes the largest score possible:

    infinity = 100000000
    
    def S1(L1, L2, s, i, j, c, scores, cache):
        key = (i, j, c)
        if key not in cache:
            if i == 0:
                if c != '*' and s[0] != c:
                    v = 0 if j >= 1 else -infinity
                else:
                    v = 0 if j >= 0 else -infinity
            else:
                v = -infinity
                for d in L1:
                    for c2 in [c] if c != '*' else L2 + s[i]:
                        jdiff = 1 if s[i] != c2 else 0
                        score = S1(L1, L2, s, i-1, j-jdiff, d, scores, cache)
                        score += scores.get(d+c2 , 0)
                        v = max(v, score)
            cache[key] = v
        return cache[key]
    
    def S(s, Q, scores):
        L1 = ''.join(sorted(set(w[0] for w in scores))) + '*'
        L2 = ''.join(sorted(set(w[1] for w in scores)))
        return S1(L1, L2, s + '.', len(s), Q, '.', scores, {})
    
    print S('bpdcg', 2, {'bz': 2, 'zd': 5, 'dm': 7, 'ng': 10})
    

    There's some room for optimisation:

    Overall, if there's k different letters in the scoring words, the algorithm runs in time O(QN*k^2). With the second optimisation above, this can be reduced to O(QNw) where w is the number of scoring words.