pythonregexgeneticsrosalind

Python Regex Finding a Match That Starts Inside Previous match


I'm looking to find the index for all substrings in a string in python. My current regex code can't find a match that has it's start in a previous match.

I have a string: s = r'GATATATGCATATACTT' and a subtring t = r'ATAT'. There should be matches at index 1, 3, and 9. Using the following code only shows matches at index 1 and 9 because index 3 is within the first match. How do I get all matches to appear?

Thanks so much!

import re

s= 'GATATATGCATATACTT'
t = r'ATAT'

pattern = re.compile(t)

[print(i) for i in pattern.finditer(s)]

Solution

  • Since you have overlapping matches, you need to use a capturing group inside a lookahead as: (?=(YOUEXPR))

    import re
    
    s= 'GATATATGCATATACTT'
    t = r'(?=(ATAT))'
    
    pattern = re.compile(t)
    
    [print(i) for i in pattern.finditer(s)]
    

    Output:

    <re.Match object; span=(1, 1), match=''>
    <re.Match object; span=(3, 3), match=''>
    <re.Match object; span=(9, 9), match=''>
    

    Or:

    [print(i.start()) for i in pattern.finditer(s)]
    

    Output:

    1
    3
    9
    

    Or:

    import re
    
    s= 'GATATATGCATATACTT'
    t = 'ATAT'
    
    pattern = re.compile(f'(?=({t}))')
    
    print ([(i.start(), s[i.start():i.start()+len(t)]) for i in pattern.finditer(s)])
    

    Output:

    [(1, 'ATAT'), (3, 'ATAT'), (9, 'ATAT')]