I wrote a solution to this challenge based on this answer. It successfully handles the example case given, but not the actual case.
Given two strings
s
andt
,t
is a substring ofs
ift
is contained as a contiguous collection of symbols ins
(as a result,t
must be no longer thans
).The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position
i
ofs
is denoted bys[i]
.A substring of
s
can be represented ass[j:k]
, wherej
andk
represent the starting and ending positions of the substring ins
; for example, ifs = "AUGCUUCAGAAAGGUCUUACG"
, thens[2:5] = "UGCU"
.The location of a substring
s[j:k]
is its beginning positionj
; note thatt
will have multiple locations ins
if it occurs more than once as a substring ofs
(see the Sample below).
Two DNA strings
s
andt
(each of length at most 1 kbp).
All locations of
t
as a substring ofs
.
GATATATGCATATACTT
ATAT
2 4 10
For the sample, it works. Sure, you need to trim the formatting by hand, but that's only a couple of seconds of work.
The actual data and my generated output are not accepted.
CAAATAGTCACACAATAGTCGGCTAAATAGTCAATAGTCAAATAGTCAGAGCTAATAGTCTAAATAGTCGAAAAATAGTCATCAATAGTCTAAATAGTCAATAGTCGGAATAGTCAAAATAGTCAATAGTCAATAGTCAATAGTCGACTAAATAGTCCCAATAGTCTCAGAAATAGTCAATAGTCGTAATAGTCAATAGTCTAATAGTCTAATAGTCCAATAGTCTGTCAAATAGTCAATAGTCCAATAGTCGTTTAATAGTCCCCTTTACCAATAGTCAATAGTCCGAATAGTCAGGAATAGTCAGCACTAATAGTCAATAGTCCTAATAGTCCCAATAGTCAAAATAGTCAATAGTCTAAATAATAGTCCTAGCAGAAGAATAGTCTAATAGTCGGCAATAGTCAATAGTCAAATAGTCAGAATAGTCAAATAGTCGAAATAGTCAATAGTCAATAGTCAAATAGTCAAATAGTCAATAGTCAAATAGTCAAATAGTCAAATAGTCGAATAGTCTGTAATAGTCAATAGTCCTTCAATAGTCTAATAGTCATTCAATAGTCAAGAAATAGTCGGGGGAATAGTCCGAATAGTCAAATAGTCAATAGTCGAATAGTCTAATAGTCAATAGTCTAATAGTCTGATAATAGTCAAATAGTCAATAGTCTAAATAGTCGCCTATGCCAATAGTCTTATCAAATAGTCTCTTAATAGTCTAATAGTCAATAGTCAATAGTCTAATAGTCATAATAGTCAATAGTCAAGGAATAGTCCCATAATAGTCAATAGTCTTAATAGTCCAAACGAAATAGTCTTAATAGTCCCTAATAGTCACTAATAGTCGTAATAGTCATAATAGTCCAATAGTCTAAATAGTCTGCAATAGTCAAATAGTCAAATAGTCCGTACAATAGTCTTAATAGTCTTTGCGGCTCAATAGTCTCATAATAGTC
AATAGTCAA
26 33 93 109 118 125 132
def find_substring_locations(long, short)
mpos = []
re = Regexp.new(short)
m = i = 0
m = re.match( long, i ) { |k| j = k.begin(0); i = j + 1; mpos << j } while m
return mpos
end
def plus_one(input)
arr = []
for i in input
arr << (i += 1)
end
arr
end
main_string = gets.chomp
sub_string = gets.chomp
plus_one(find_substring_locations(main_string, sub_string))
Where did I go wrong? It appears to be in order.
EDIT: Turns out the problem was a hiccup in the environment. After a reboot the problem could not be reproduced.
I think your code is correct besides an off by one error. The plus_one
solution can be simplified.
m = re.match( long, i ) { |k| j = k.begin(0); i = j + 1; mpos << j + 1} while m
But there is an easier way to implement. You don't need the Regexp, there is an easier way to search for the index of a substring match:
String#index
takes an additional argument, the starting index.
input = "CAAATAGTCACACAATAGTCGGCTAAATAGTCAATAGTCAAATAGTCAGAGCTAATAGTCTAAATAGTCGAAAAATAGTCATCAATAGTCTAAATAGTCAATAGTCGGAATAGTCAAAATAGTCAATAGTCAATAGTCAATAGTCGACTAAATAGTCCCAATAGTCTCAGAAATAGTCAATAGTCGTAATAGTCAATAGTCTAATAGTCTAATAGTCCAATAGTCTGTCAAATAGTCAATAGTCCAATAGTCGTTTAATAGTCCCCTTTACCAATAGTCAATAGTCCGAATAGTCAGGAATAGTCAGCACTAATAGTCAATAGTCCTAATAGTCCCAATAGTCAAAATAGTCAATAGTCTAAATAATAGTCCTAGCAGAAGAATAGTCTAATAGTCGGCAATAGTCAATAGTCAAATAGTCAGAATAGTCAAATAGTCGAAATAGTCAATAGTCAATAGTCAAATAGTCAAATAGTCAATAGTCAAATAGTCAAATAGTCAAATAGTCGAATAGTCTGTAATAGTCAATAGTCCTTCAATAGTCTAATAGTCATTCAATAGTCAAGAAATAGTCGGGGGAATAGTCCGAATAGTCAAATAGTCAATAGTCGAATAGTCTAATAGTCAATAGTCTAATAGTCTGATAATAGTCAAATAGTCAATAGTCTAAATAGTCGCCTATGCCAATAGTCTTATCAAATAGTCTCTTAATAGTCTAATAGTCAATAGTCAATAGTCTAATAGTCATAATAGTCAATAGTCAAGGAATAGTCCCATAATAGTCAATAGTCTTAATAGTCCAAACGAAATAGTCTTAATAGTCCCTAATAGTCACTAATAGTCGTAATAGTCATAATAGTCCAATAGTCTAAATAGTCTGCAATAGTCAAATAGTCAAATAGTCCGTACAATAGTCTTAATAGTCTTTGCGGCTCAATAGTCTCATAATAGTC"
query = "AATAGTCAA"
def substring_index(input, query)
last = 0
matches = []
while index = input.index(query, last) do
matches << index += 1
last = index
end
matches
end
p substring_index(input, query)
# => [26, 33, 93, 109, 118, 125, 132, 172, 188, 231, 273, 312, 337, 346, 400, 407, 424, 441, 448, 455, 463, 471, 478, 486, 494, 520, 557, 589, 597, 620, 646, 654, 718, 725, 749, 756, 778, 882, 890]