pythonregexindexingmatchoverlapping-matches

Match and index all substrings, including overlapping ones


I'm trying to index the matches using the new regex findall, so that overlapped matches can be considered. However, I could only find the matches, but can't correctly give locations for them.

My code:

import regex as re
seq = "ATCCAAGGAGTTTGCAGAGGTGGCGTTTGCAGCATGAGAT"
substring="GTTTGCAG"
xx=re.findall(substring,seq,overlapped=True)
print xx

xx would look like

['GTTTGCAG', 'GTTTGCAG']

because there are two matches at positions 10-17 and 25-32.

However how could I obtain these numbers please? By checking dir(xx), there is no start/end/pos that I could use in this new function. (I tried xx.index(substring), but this seems to only gives the index for the resulting list: e.g. 0 and 1 in this case)

Thank you.


Solution

  • Using re.finditer, you can obtain start locations:

    import re
    seq = "blahblahblahLALALAblahblahLALA"
    substring="LALA"
    lenss=len(substring)
    overlapsearch="(?=(\\"+substring+"))"
    xx=[[x.start(),x.start()+lenss] for x in list(re.finditer(overlapsearch,seq))]
    check=[seq[x[0]:x[1]] for x in xx]
    print xx
    print check
    

    Results:

    [[12, 16], [14, 18], [26, 30]]
    ['LALA', 'LALA', 'LALA']
    

    And results using your original example:

    [[9, 17], [24, 32]]
    ['GTTTGCAG', 'GTTTGCAG']
    

    Adding "?=" to the substring search tells regex that the next match can use the characters from the previous match