[SOLVED] Match and index all substrings, including overlapping ones

Match and index all substrings, including overlapping ones

I'm trying to index the matches using the new regex findall, so that overlapped matches can be considered. However, I could only find the matches, but can't correctly give locations for them.

My code:

import regex as re
seq = "ATCCAAGGAGTTTGCAGAGGTGGCGTTTGCAGCATGAGAT"
substring="GTTTGCAG"
xx=re.findall(substring,seq,overlapped=True)
print xx

xx would look like

['GTTTGCAG', 'GTTTGCAG']

because there are two matches at positions 10-17 and 25-32.

However how could I obtain these numbers please? By checking dir(xx), there is no start/end/pos that I could use in this new function. (I tried xx.index(substring), but this seems to only gives the index for the resulting list: e.g. 0 and 1 in this case)

Thank you.

Solution

Using re.finditer, you can obtain start locations:

import re
seq = "blahblahblahLALALAblahblahLALA"
substring="LALA"
lenss=len(substring)
overlapsearch="(?=(\\"+substring+"))"
xx=[[x.start(),x.start()+lenss] for x in list(re.finditer(overlapsearch,seq))]
check=[seq[x[0]:x[1]] for x in xx]
print xx
print check

Results:

[[12, 16], [14, 18], [26, 30]]
['LALA', 'LALA', 'LALA']

And results using your original example:

[[9, 17], [24, 32]]
['GTTTGCAG', 'GTTTGCAG']

Adding "?=" to the substring search tells regex that the next match can use the characters from the previous match