I am working with strings of non-latin characters. I want to match strings with reduplication patterns, such as AAB, ABB, ABAB, etc. I tried out the following code:
import re
patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.findall(rawtext)
print(match)
However, it reurns only the first character of the matched string. I know this happens because of the capturing parenthesis around the first \w.
I tried to add capturing parenthesis around the whole matched block, but Python gives
error: cannot refer to an open group at position 7
I also found this method,but didn't work for me:
patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.search(rawtext)
if match:
print(match.group(1))
How could I match the pattern and return the whole matching string?
# Ex. 哈哈笑
# string matches AAB pattern so my code returns 哈
# but not the entire string
The message:
error: cannot refer to an open group at position 7
is telling you that \1
refers to the group with parentheses all around, because its opening parenthesis comes first. The group you want to backreference is number 2, so this code works:
import re
rawtext = 'abc 哈哈笑 def'
patternAAB = re.compile(r'\b((\w)\2\w)\b')
match = patternAAB.findall(rawtext)
print(match)
Each item in match
has both groups:
[('哈哈笑', '哈')]