pythonpython-3.xregexregex-lookaroundsregex-group

Regex: Find matches only outside of single quotes


I currently have a regex that selects all occurrences of (, , , );:

(\s\()|(,\s)|(\),)|(\);)

however, I've been trying to figure out a way so that if anything is between single quotes 'like this, for example', it'll ignore any of the matches listed above. I tried many different solutions, however none of them seemed to work for me.

Does anyone know of ways I could make this work?


Solution

  • Expressing a not-find is a tricky thing, as everything around a regex is designed to work in a positive / greedy way (find as much as possible, whenever somehow possible).

    The easiest and most likely fastest thing you could to is to remove the parts you want to exclude prior to applying your search, assuming quotes always appear in pairs:

    "'[^']*'" => ""
    

    and then apply your search to the remaining string. If the string needs to be modified "inplace", you could first search for these things and replace them with arbitrary, non-colliding placeholders that do not appear naturally, and replace them later again. (I quite often use something like ###Placeholder1### or something for that purpose. Easy to match and replace again, and almost guaranteed to not appear elsewhere naturally).

    Python example:

    import re
    
    text = "this is a , and this a ( whith a ) while 'this ( is in quotes,therefore excluded' unlike these: ( ) , but 'these () are again'. period."
    print(text)
    placeholders = []
    def repl(m):
        contents = m.group(1)
        placeholders.append(contents)
        return "###Placeholder{0}###".format(len(placeholders) - 1)
    
    temp=re.sub('(\'[^\']*\')', repl, text)
    print(temp)
    
    temp=re.sub('([,\)\(])', "`\\1`", temp)
    print(temp)
    for k in range(len(placeholders)):
      temp = re.sub("###Placeholder{0}###".format(k), placeholders[k], temp)
    
    print(temp)
    

    (Note that the ### also ensures that Placeholder1 and Placeholder13 won't collide later on.)

    this is a , and this a ( whith a ) while 'this ( is in quotes,therefore excluded' unlike these: ( ) , but 'these () are again'. period.

    this is a , and this a ( whith a ) while ###Placeholder0### unlike these: ( ) , but ###Placeholder1###. period.

    this is a , and this a ( whith a ) while ###Placeholder0### unlike these: ( ) , but ###Placeholder1###. period.

    this is a , and this a ( whith a ) while 'this ( is in quotes,therefore excluded' unlike these: ( ) , but 'these () are again'. period.


    Or with the pythonic * operator, the final round of re-replacing could be omitted. (This however may cause issues if {0} and stuff appear naturally):

    import re
    
    text = "this is a , and this a ( whith a ) while 'this ( is in quotes,therefore excluded' unlike these: ( ) , but 'these () are again'. period."
    print(text)
    placeholders = []
    def repl(m):
        placeholders.append(m.group(1))
        return "{"+"{0}".format(len(placeholders) - 1) + "}"
    
    temp=re.sub('(\'[^\']+\')', repl, text)
    print(temp)
    
    temp=re.sub('([,\)\(])', "`\\1`", temp)
    print(temp)
    
    temp = temp.format(*placeholders)
    print(temp)