pythonregexregex-lookaroundslookbehind

Python Regex Engine - "look-behind requires fixed-width pattern" Error


I am trying to handle un-matched double quotes within a string in the CSV format.

To be precise,

"It "does "not "make "sense", Well, "Does "it"

should be corrected as

"It" "does" "not" "make" "sense", Well, "Does" "it"

So basically what I am trying to do is to

replace all the ' " '

  1. Not preceded by a beginning of line or a comma (and)
  2. Not followed by a comma or an end of line

with ' " " '

For that I use the below regex

(?<!^|,)"(?!,|$)

The problem is while Ruby regex engines ( http://www.rubular.com/ ) are able to parse the regex, python regex engines (https://pythex.org/ , http://www.pyregex.com/) throw the following error

Invalid regular expression: look-behind requires fixed-width pattern

And with python 2.7.3 it throws

sre_constants.error: look-behind requires fixed-width pattern

Can anyone tell me what vexes python here?


Edit:

Following Tim's response, I got the below output for a multi line string

>>> str = """ "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it" """
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str)
' "It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" " '

At the end of each line, next to 'it' two double-quotes were added.

So I made a very small change to the regex to handle a new-line.

re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)

But this gives the output

>>> re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
' "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it" " '

The last 'it' alone has two double-quotes.

But I wonder why the '$' end of line character will not identify that the line has ended.


The final answer is

re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)

Solution

  • Python lookbehind assertions need to be fixed width, but you can try this:

    >>> s = '"It "does "not "make "sense", Well, "Does "it"'
    >>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
    '"It" "does" "not" "make" "sense", Well, "Does" "it"'
    

    Explanation:

    \b      # Start the match at the end of a "word"
    \s*     # Match optional whitespace
    "       # Match a quote
    (?!,|$) # unless it's followed by a comma or end of string