python string merge string-comparison sequence-alignment

How to merge strings with overlapping characters in python?

I'm working on a python project which reads in an URL encoded overlapping list of strings. Each string is 15 characters long and overlaps with its sequential string by at least 3 characters and at most 15 characters (identical).

The goal of the program is to go from a list of overlapping strings - either ordered or unordered - to a compressed URL encoded string.

My current method fails at duplicate segments in the overlapping strings. For example, my program is incorrectly combining:

StrList1 = [ 'd+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']

to output:

output = ['ublic+class+HelloWorld+%7B%0A++++public+', '%2F%2F+Sample+program%0Apublic+static+v`]

when correct output is:

output = ['%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v']

I am using simple python, not biopython or sequence aligners, though perhaps I should be?

Would greatly appreciate any advice on the matter or suggestions of a nice way to do this in python!

Thanks!

Solution

You can start with one of the strings in the list (stored as string), and for each of the remaining strings in the list (stored as candidate) where:

candidate is part of string,
candidate contains string,
candidate's tail matches the head of string,
or, candidate's head matches the tail of string,

assemble the two strings according to how they overlap, and then recursively repeat the procedure with the overlapping string removed from the remaining strings and the assembled string appended, until there is only one string left in the list, at which point it is a valid fully assembled string that can be added to the final output.

Since there can potentially be multiple ways several strings can overlap with each other, some of which can result in the same assembled strings, you should make output a set of strings instead:

def assemble(str_list, min=3, max=15):
    if len(str_list) < 2:
        return set(str_list)
    output = set()
    string = str_list.pop()
    for i, candidate in enumerate(str_list):
        matches = set()
        if candidate in string:
            matches.add(string)
        elif string in candidate:
            matches.add(candidate)
        for n in range(min, max + 1):
            if candidate[:n] == string[-n:]:
                matches.add(string + candidate[n:])
            if candidate[-n:] == string[:n]:
                matches.add(candidate[:-n] + string)
        for match in matches:
            output.update(assemble(str_list[:i] + str_list[i + 1:] + [match]))
    return output

so that with your sample input:

StrList1 = ['d+%7B%0A++++public+', 'public+static+v','program%0Apublic+', 'ublic+class+Hel', 'lass+HelloWorld', 'elloWorld+%7B%0A+++', '%2F%2F+Sample+progr', 'program%0Apublic+']

assemble(StrList1) would return:

{'%2F%2F+Sample+program%0Apublic+class+HelloWorld+%7B%0A++++public+static+v'}

or as an example of an input with various overlapping possibilities (that the second string can match the first by being inside, having tail matching the head, and having head matching the tail):

assemble(['abcggggabcgggg', 'ggggabc'])

would return:

{'abcggggabcgggg', 'abcggggabcggggabc', 'abcggggabcgggggabc', 'ggggabcggggabcgggg'}