I'm trying to find the longest common substring in all DNA strands but in testing I'm using strings of numbers.
Here is the funciton I wrote:
def lcsm(string_list):
strands = sorted(string_list, key = len, reverse = False)
longest_motif = ''
seq1 = strands[0]
seq2 = strands[1]
motif = ''
for ind1 in range(len(seq1)): # iterate over the shortest string
i = 0
for ind2 in range(len(seq2)):
if ind1+i < len(seq1) and seq1[ind1+i] == seq2[ind2]:
motif += seq1[ind1+i]
i += 1
if len(motif) >= len(longest_motif) and all(motif in x for x in strands):
longest_motif = motif
else:
motif = ''
i = 0
return longest_motif
print('right: ', lcsm(['123456789034357890',
'123456789034357890890357890',
'4612345678901234567890343578904654734357890',
'12356734121234567890343578903456789035789012345']))
print('wrong: ', lcsm(['123456789034357890',
'123123456789034357890890357890',
'4612345678901234567890343578904654734357890',
'12356734121234567890343578903456789035789012345']))
My input is list of strings and the output should be the longest common string. In this case the result should be: '123456789034357890
'.
My problem is that when my searched sequence is preceded by a cluster of digits with which this sequence begins the first digit of the right answer is skipped.
The first print of my function shows the right answer and the second one has the mistake I've spoken about.
Pay attention to the second string in the list (in the 'wrong
' print statement).
As you see below, the first digit '1' is missing.
right: 123456789034357890
wrong: 23456789034357890
You are only checking for the presence of the current motif
substring in all the strands when its length is greater than or equal to the length of the previous longest common substring and you are not accounting for cases where the current motif
substring is shorter than the previous longest common substring.
Instead you should only update the longest_motif
variable if the current motif
substring is longer than the previous longest common substring and is present in all the strands by modifying the function like this:
def lcsm(string_list):
strands = sorted(string_list, key=len)
shortest_strand = strands[0]
longest_motif = ''
for i in range(len(shortest_strand)):
for j in range(i + len(longest_motif) + 1, len(shortest_strand) + 1):
motif = shortest_strand[i:j]
if all(motif in strand for strand in strands[1:]):
longest_motif = motif
return longest_motif
print('right: ', lcsm(['123456789034357890',
'123456789034357890890357890',
'4612345678901234567890343578904654734357890',
'12356734121234567890343578903456789035789012345']))
print('right: ', lcsm(['123456789034357890',
'123123456789034357890890357890',
'4612345678901234567890343578904654734357890',
'12356734121234567890343578903456789035789012345']))