pythonregexreplacettl

Replacing Overlapping Regex Patterns in Python


I am dealing with trying to make a .ttl file I was handed digestible. One of the issues is that the rdfs:seeAlso values are not sanitized and it breaks down downstream programs. What I mean by this is that there are links of the form:

rdfs:seeAlso prefix:value_(discipline)

In order to fix this, I need to precede particular characters with a \, per the RDF 1.1 Turtle documentation. Of the characters present, I need to escape the following:

_, ~, -, !, $, &, (, ), *, +, =, ?, #, %

At first I thought this would be easy and I began constructing a re.sub() pattern. I tried a number of potential solutions, but the closest I could get was with:

re.sub(pattern=r"(rdfs\:seeAlso)(.{0,}?)([\_\~\-\!\$\&\(\)\*\+\=\?\#\%]{1})(.{0,})", repl='\\1\\2\\\\\\3\\4', string=str_of_ttl_file)

The (rdfs\:seeAlso) component was added in order to prevent accidentally changing characters within strings that are instances of rdfs:label and rdfs:comment (i.e. any of the above characters in between '' or "").

However, this has the drawback of only working for the first occurrence and results in:

rdfs:seeAlso prefix:value\_(discipline)

Where it should be

rdfs:seeAlso prefix:value\_\(discipline\)

Any help or guidance with this would be much appreciated!

EDIT 1: Instances of rdfs:label and rdfs:comment are strings that are between single (') or double (") quotes, such as:

rdfs:label "example-label"@en

Or

rdfs:comment "This_ is+ an $example$ comment where n&thing should be replaced."@en

The special characters there do not need to be replaced for Turtle to function and should therefore be left alone by the regular expression.


Solution

  • First you don't have to escape characters inside [...] in your pattern (- should be last however, otherwise in will be recognized as range). This will make your code more readable. Then you can replace in a while loop and use a lookbehind to ensure that the character isn't already escaped:

    import re
    
    input_text = "rdfs:seeAlso prefix:value_(discipline)" 
    
    pattern = re.compile(r"(rdfs:seeAlso.*?)(?<!\\)([_~!$&()*+=?#%-])")
    
    repl_str = ''
    while repl_str != input_text:
        repl_str = input_text
        input_text = re.sub(pattern, r'\1\\\2', repl_str)
    
    print(input_text)
    

    Note: using raw string for your replace pattern makes it much more readable

    Output:

    rdfs:seeAlso prefix:value\_\(discipline\)