sparqlwikidatablazegraphfederated-queries

comparing labels in a federated query


I have an instance of Wikibase running. I'm able to run federated queries with Wikidata successfully. I have certain queries that compare labels like this:

PREFIX xwdt: <http://www.wikidata.org/prop/direct/>
PREFIX xwd: <http://www.wikidata.org/entity/>
PREFIX xpq: <http://www.wikidata.org/prop/qualifier/>
PREFIX xps: <http://www.wikidata.org/prop/statement/>
PREFIX xp: <http://www.wikidata.org/prop/>

select ?item  ?wditem ?itemLabel ?wid ?wditemlabel
where {
  ?item wdt:P17 wd:Q39.
  ?item wdt:P31 wd:Q5.
  optional {
    ?item wdt:P14 ?wid .
  }
  ?item rdfs:label ?itemLabel.   
  SERVICE <https://query.wikidata.org/sparql> {
    ?wditem xwdt:P27 xwd:Q258.
    ?wditem xwdt:P106 xwd:Q937857.
    ?wditem rdfs:label ?wditemlabel.
    filter(LANGMATCHES(LANG(?wditemlabel), "en")).
  }
  filter(contains(?wditemlabel, ?itemLabel))
}
group by ?item ?itemLabel ?wid ?wditem ?wditemlabel

The above works and matches items by their labels however:

1) I initially had filter(contains(?wditemlabel, ?itemLabel)) inside the SERVICE clause and it returned no results. But it seemed to work if I used a static string for one of the variables (e.g. filter(contains("test string", ?itemLabel))). Why would it work when comparing a variable and a string but not two variables?

2) I expected the query to work without the "group by" at the end. But it looks like without it, some sort of cross join/Cartesian product occurs and each item that is matched is repeated the total number of times (n * n). What part of the query is causing this?


Solution

  • Executing federated query, your local Blazegraph performs queries of this kind to Wikidata:

    SELECT ?wditem ?wditemlabel
    WHERE {
        ?wditem wdt:P27 wd:Q258.
        ?wditem wdt:P106 wd:Q937857.
        ?wditem rdfs:label ?wditemlabel.
        filter(LANGMATCHES(LANG(?wditemlabel), "en"))
        filter(contains(?wditemlabel, ?itemlabel))
    }
    VALUES () {
    ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )
    ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )
    ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )
    ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )
    ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )  ( ) ( ) ( ) ( ) ( )
    } # 100 values
    

    As you can see, Blazegraph "forgets" to pass local bindings of ?itemLabel into VALUES — probably because ?itemLabel does not occur in remote triple patterns — but "thinks" that they were passed.

    This bug causes both your problems:

    1. Try the above query on Wikidata (0 results)
    2. Try the above query on Wikidata without contains (82800 result instead of 828)

    Workarounds

    Force query execution order using hints:

    select ?item ?wditem ?itemLabel ?wditemlabel
    where {
      hint:Query hint:optimizer "None"
      SERVICE <https://query.wikidata.org/sparql> {
        ?wditem wdt:P27 wd:Q258.
        ?wditem wdt:P106 wd:Q937857.
        ?wditem rdfs:label ?wditemlabel.
        filter(lang(?wditemlabel)= "en").
      } 
      ?item wdt:P17 wd:Q39.
      ?item wdt:P31 wd:Q5.
      ?item rdfs:label ?itemLabel.
      filter(contains(?wditemlabel, ?itemLabel))
    }
    

    or

    select ?item ?wditem ?itemLabel ?wditemlabel
    where {
      ?item wdt:P17 wd:Q39.
      ?item wdt:P31 wd:Q5.
      ?item rdfs:label ?itemLabel.
      SERVICE <https://query.wikidata.org/sparql> {
        ?wditem wdt:P27 wd:Q258.
        ?wditem wdt:P106 wd:Q937857.
        ?wditem rdfs:label ?wditemlabel.
        filter(lang(?wditemlabel)= "en").
      }
      hint:Prior hint:runFirst true .
      filter(contains(?wditemlabel, ?itemLabel))
    }
    

    By the way, you could use DISTINCT instead of GROUP BY in your original query, or use additional local filtering, i. e. filter(lang(?itemLabel)='ast').

    Comparison

    In GraphDB, original query works well, but one should replace contains(?wditemlabel, ?itemLabel) with contains(str(?wditemlabel), str(?itemLabel)).

    See also