pythonregexoverlapping-matches

Overlapping regex matches


I'm trying to create the following regular expression: return a string between AUG and (UAG or UGA or UAA) from a following RNA string: AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG, so that all matches would be found, including the overlapping ones.

I've tried several regexes, ending up with something like that:

matches = re.findall('(?=AUG)(\w+)(?=UAG|UGA|UAA)',"AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG")

Could you show me the errors in my regex pattern?


Solution

  • Doing this with one regular expression is actually pretty difficult, as most uses specifically don't want overlapping matches. You could, however, do this with some simple iteration:

    regex = re.compile('(?=AUG)(\w+)(?=UAG|UGA|UAA)');
    RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    matches = []
    tmp = RNA
    while (match = regex.search(tmp)):
        matches.append(match)
        tmp = tmp[match.start()-2:]  #Back up two to get the UG portion.  Shouldn't matter, but safer.
    
    for m in matches:
        print m.group(0)
    

    Though, this has some problems. What would you expect the return to be in the case of AUGUAGUGAUAA? Are there two strings to be returned? Or just one? Right now, your regular expression isn't even capable of capturing UAG, as it continues on through to match UAGUGA and get cut off at UAA. To combat this, then, you might wish to use the ? operator to make your operator lazy - an approach that then will be unable to capture the longer substring.

    Maybe iteration over the string twice is the answer, but then what if your RNA Sequence contains AUGAUGUAGUGAUAA? What's the correct behaviour there?

    I might favor a regular expression free approach, by iterating over the string and its substrings:

    RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    candidates = []
    start = 0
    
    while (RNA.find('AUG', start) > -1):
        start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
        candidates.append(RNA[start+3:])
        start += 1
    
    matches = []
    
    for candidate in candidates:
        for terminator in ['UAG', 'UGA', 'UAA']:
            end = 1;
            while(candidate.find(terminator, end) > -1):
                end = candidate.find(terminator, end)
                matches.append(candidate[:end])
                end += 1
    
    for match in matches:
        print match
    

    This way, you're sure to get all matches, no matter what.

    If you need to keep track of the position of each match, you can modify your candidates data structure to use tuples which maintain the starting position:

    RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    candidates = []
    start = 0
    
    while (RNA.find('AUG', start) > -1):
        start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
        candidates.append((RNA[start+3:], start+3))
        start += 1
    
    matches = []
    
    for candidate in candidates:
        for terminator in ['UAG', 'UGA', 'UAA']:
            end = 1;
            while(candidate[0].find(terminator, end) > -1):
                end = candidate[0].find(terminator, end)
                matches.append((candidate[1], candidate[1] + end, candidate[0][:end]))
                end += 1
    
    for match in matches:
        print "%d - %d: %s" % match
    

    which prints:

    7 - 49: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
    7 - 85: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    7 - 31: UAGCUAACUCAGGUUACAUGGGGA
    7 - 72: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    7 - 76: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    7 - 11: UAGC
    7 - 66: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    27 - 49: GGGAUGACCCCGCGACUUGGAU
    27 - 85: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    27 - 31: GGGA
    27 - 72: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    27 - 76: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    27 - 66: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    33 - 49: ACCCCGCGACUUGGAU
    33 - 85: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    33 - 72: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    33 - 76: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    33 - 66: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    78 - 85: AUCCGAG
    

    Hell, with literally three more lines, you can even sort the matches according to where they fall in the RNA sequence:

    from operator import itemgetter
    matches.sort(key=itemgetter(1))
    matches.sort(key=itemgetter(0)) 
    

    That placed before the final print nets you:

    007 - 011: UAGC
    007 - 031: UAGCUAACUCAGGUUACAUGGGGA
    007 - 049: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
    007 - 066: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    007 - 072: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    007 - 076: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    007 - 085: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    027 - 031: GGGA
    027 - 049: GGGAUGACCCCGCGACUUGGAU
    027 - 066: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    027 - 072: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    027 - 076: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    027 - 085: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    033 - 049: ACCCCGCGACUUGGAU
    033 - 066: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    033 - 072: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    033 - 076: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    033 - 085: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    078 - 085: AUCCGAG