pythonregexbackreferencecapture-group

Numerical reference for backreference not working out in Python


I was trying to deal with difflib matches that return double word place names when only one of the words has been used to make the match. That is: when I do the difflib regex substitution I get a double up of the second word.

Approach:

I don't understand the output I am getting using Python backreferences.

# removeDupeWords.py    --- test to remove double words eg "The sun shines in,_Days_Bay Bay some of the time"

import re

testString = "The sun shines in,_Days_Bay Bay some of the time"

# regex to capture comma to space of testString e.g ',_Days_Bay'
refRegex = '(,\S+)'

# regex to capture everything after e.g 'Bay some of the time'
afterRegex = '(,\S+)(.*)'

refString = re.search(refRegex, testString).group(0)
# print(refString)

afterString = re.sub(afterRegex, r'\2', testString)
print(afterString)

The output for r'\0', r'\1' & r'\2' is as follows:

The sun shines in
The sun shines in,_Days_Bay
The sun shines in Bay some of the time

I just want ' Bay some of the time' The docs Regular Expression HOWTO don't go into backreferences in much detail. I couldn't get enough info to offer any explanation why I would even get any output for r'\0'


Solution

  • Let's try this again.

    You are using re.sub, which only messes with the part of the string that actually matches your regex. So your regex divides your original string into three parts: The sun shines in, which does not match your regex at all and will not be replaced by anything, ,_Days_Bay which matches the first parenthesized group (,\S+) and goes into \1, and the rest of the string, Bay some of the time, which matches the second parenthesized group (.*) and goes into \2.

    So, the entire regex match is ,_Days_Bay Bay some of the time and all of that will be removed from the result and replaced with whatever you told it to use in parameter #2 to re.sub.

    The part that did not match at all was The sun shines in so it goes into your result string without modification.

    Once again, re.sub only modifies the part of the string that matches your regex.