I am dealing with trying to make a .ttl
file I was handed digestible. One of the issues is that the rdfs:seeAlso
values are not sanitized and it breaks down downstream programs. What I mean by this is that there are links of the form:
rdfs:seeAlso prefix:value_(discipline)
In order to fix this, I need to precede particular characters with a \
, per the RDF 1.1 Turtle documentation. Of the characters present, I need to escape the following:
_, ~, -, !, $, &, (, ), *, +, =, ?, #, %
At first I thought this would be easy and I began constructing a re.sub()
pattern. I tried a number of potential solutions, but the closest I could get was with:
re.sub(pattern=r"(rdfs\:seeAlso)(.{0,}?)([\_\~\-\!\$\&\(\)\*\+\=\?\#\%]{1})(.{0,})", repl='\\1\\2\\\\\\3\\4', string=str_of_ttl_file)
The (rdfs\:seeAlso)
component was added in order to prevent accidentally changing characters within strings that are instances of rdfs:label
and rdfs:comment
(i.e. any of the above characters in between ''
or ""
).
However, this has the drawback of only working for the first occurrence and results in:
rdfs:seeAlso prefix:value\_(discipline)
Where it should be
rdfs:seeAlso prefix:value\_\(discipline\)
Any help or guidance with this would be much appreciated!
EDIT 1: Instances of rdfs:label
and rdfs:comment
are strings that are between single ('
) or double ("
) quotes, such as:
rdfs:label "example-label"@en
Or
rdfs:comment "This_ is+ an $example$ comment where n&thing should be replaced."@en
The special characters there do not need to be replaced for Turtle to function and should therefore be left alone by the regular expression.
First you don't have to escape characters inside [...]
in your pattern (-
should be last however, otherwise in will be recognized as range). This will make your code more readable. Then you can replace in a while loop and use a lookbehind to ensure that the character isn't already escaped:
import re
input_text = "rdfs:seeAlso prefix:value_(discipline)"
pattern = re.compile(r"(rdfs:seeAlso.*?)(?<!\\)([_~!$&()*+=?#%-])")
repl_str = ''
while repl_str != input_text:
repl_str = input_text
input_text = re.sub(pattern, r'\1\\\2', repl_str)
print(input_text)
Note: using raw string for your replace pattern makes it much more readable
Output:
rdfs:seeAlso prefix:value\_\(discipline\)