pythonregexnon-latin

How to return whole non-latin strings matching a reduplication pattern, such as AAB or ABB


I am working with strings of non-latin characters. I want to match strings with reduplication patterns, such as AAB, ABB, ABAB, etc. I tried out the following code:

import re

patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.findall(rawtext)
print(match) 

However, it reurns only the first character of the matched string. I know this happens because of the capturing parenthesis around the first \w.

I tried to add capturing parenthesis around the whole matched block, but Python gives

error: cannot refer to an open group at position 7

I also found this method,but didn't work for me:

patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.search(rawtext)
if match:
    print(match.group(1))

How could I match the pattern and return the whole matching string?

# Ex. 哈哈笑 
# string matches AAB pattern so my code returns 哈 
# but not the entire string

Solution

  • The message:

    error: cannot refer to an open group at position 7
    

    is telling you that \1 refers to the group with parentheses all around, because its opening parenthesis comes first. The group you want to backreference is number 2, so this code works:

    import re
    
    rawtext = 'abc 哈哈笑 def'
    
    patternAAB = re.compile(r'\b((\w)\2\w)\b')
    match = patternAAB.findall(rawtext)
    print(match)
    

    Each item in match has both groups:

    [('哈哈笑', '哈')]