I'm trying to index the matches using the new regex findall, so that overlapped matches can be considered. However, I could only find the matches, but can't correctly give locations for them.
My code:
import regex as re
seq = "ATCCAAGGAGTTTGCAGAGGTGGCGTTTGCAGCATGAGAT"
substring="GTTTGCAG"
xx=re.findall(substring,seq,overlapped=True)
print xx
xx would look like
['GTTTGCAG', 'GTTTGCAG']
because there are two matches at positions 10-17 and 25-32.
However how could I obtain these numbers please? By checking dir(xx), there is no start/end/pos that I could use in this new function. (I tried xx.index(substring), but this seems to only gives the index for the resulting list: e.g. 0 and 1 in this case)
Thank you.
Using re.finditer, you can obtain start locations:
import re
seq = "blahblahblahLALALAblahblahLALA"
substring="LALA"
lenss=len(substring)
overlapsearch="(?=(\\"+substring+"))"
xx=[[x.start(),x.start()+lenss] for x in list(re.finditer(overlapsearch,seq))]
check=[seq[x[0]:x[1]] for x in xx]
print xx
print check
Results:
[[12, 16], [14, 18], [26, 30]]
['LALA', 'LALA', 'LALA']
And results using your original example:
[[9, 17], [24, 32]]
['GTTTGCAG', 'GTTTGCAG']
Adding "?=" to the substring search tells regex that the next match can use the characters from the previous match