pythoncharacter-encodingttlrdflib

rdflib's parseQuery decode the query string which cause invalid URI


I have the following ttl file:

@prefix : <https://www.example.co/reserved/language#> .

<https://www.example.co/reserved/root> :_id "01G39WKRH76BGY5D3SKDHJP2SX" ;
    :transcript%20data [ :_id "01G39WKRH7JYRX78X7FG4RCNYF" ;
            :_key "transcript%20data" ;
            :value "value" ;
            :value_id "01G39WKRH7PVK1DXQHWT08DZA8" ] .

And I have the following query:

q = """
PREFIX : <https://www.example.co/reserved/language#>

    SELECT  ?o 
    WHERE { ?s :transcript%20data/:value ?o . }
""" 

While trying to query the graph I got from the ttl file I got the following error:

https://www.example.co/reserved/language#transcript data does not look like a valid URI, trying to serialize this will break.

As you see, parseQuery has decoded the "%20" to a space " " which cases invalid URI. And this will return False while passed to _is_valid_uri function.

I've tested the query on different SPARQL engines and it is valid and works as expected. So, what do you advise? to make the query valid and get the required results?

I am using rdflib Version: 6.1.1 on macOS Monterey 12.4


Solution

  • It was a bug in rdflib in SPARQL parser and it is fixed in this PR

    Seems like _hexExpand internal SPARQL parser function inappropriately expands percent-encoded reserved characters. Added an exclusionary regexp to disable this behaviour and a parameterized test which checks SPARQL parser processing of the set of percent-encoded reserved chars