sparqlrdfdbpediavirtuosordflib

Weird behavior on LIMIT and OFFSET when querying DBPedia


I'm querying DBPedia's Virtuoso endpoint via RDFLib in order to get all entities of type dbo:Politician with no other occupation than that, and I have noticed that the results I get when performing the query with increasing OFFSETs over the LIMIT (10000) doesn't contain all results

def get_persons_for_occupation(occupation_URI):
    offset = 0
    limit = 10000 # DBPedia's Virtuoso SPARQL limit
    persons = []

    while True:
        g = Graph()

        try:
            query = """
                PREFIX dbo: <http://dbpedia.org/ontology/>
                PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
                PREFIX foaf: <http://xmlns.com/foaf/0.1/>
                PREFIX owl: <http://www.w3.org/2002/07/owl#>

                SELECT *
                WHERE {
                    SERVICE <https://dbpedia.org/sparql> {
                        SELECT ?person_with_occupation ?wikipedia_URL ?wikidata_URI
                        WHERE {
                            ?person_with_occupation rdf:type/rdfs:subClassOf* %s.
                            # TODO for debugging
                            # FILTER(regex(?person_with_occupation, "Trump"))

                            # Discard persons with an occupation (class) different than ours
                            FILTER NOT EXISTS {
                                ?person_with_occupation a ?other_occupation.
                                # That is not the occupation itself
                                FILTER ((?other_occupation != %s)
                                        &&
                                        # That is not a subclass of ours (* allows for indirect subclasses through the type hierarchy)
                                        NOT EXISTS { ?other_occupation rdfs:subClassOf* %s }
                                        &&
                                        # And that is a subclass of dbo:Person
                                        EXISTS { ?other_occupation rdfs:subClassOf dbo:Person })
                            }

                            # They have a Wikipedia article
                            ?person_with_occupation foaf:isPrimaryTopicOf ?wikipedia_URL.

                            # And also an equivalent URI in Wikidata (in order to get its PageRank)
                            ?person_with_occupation owl:sameAs ?wikidata_URI.
                            FILTER (STRSTARTS(STR(?wikidata_URI), "http://www.wikidata.org"))
                        }
                        LIMIT 10000
                        OFFSET %s
                    }
                }
                """ % (occupation_URI, occupation_URI, occupation_URI, offset)

            qres = g.query(prepareQuery(query))

        except SPARQLResult as e:
            # Received correct but partial results (on the final offset), 
            # we don't want it to be an exception
            if e.response.status_code == 206:
                qres = JSONResultParser().parse(e.response.content)
            else:
                raise

        n_results = len(qres)
        if n_results == 0:
            break

        #for row in qres:
            #do stuff
        
        offset += limit

where occupation_URI = "dbo:Politician"

When collecting all results, I noticed that I got 27792 entities, but there are 74128 of them if I ask for a COUNT (especially, some entities such as Donald Trump's are not returned, but if I FILTER for it, it is returned). Is there a hard limit that I don't know of?


Solution

  • This is probably caused by the Anytime Queries feature / odd behavior / bug that lets Virtuoso return incomplete results without telling the client in a standards compliant way. That even happens inside of aggregations, which might explain varying COUNT results. (Detail can be found in the long and fruitless discussion at openlink/virtuoso-opensource#112.) A client could recognize the incomplete result by checking for the HTTP response header X-SQL-State: S1TAT. (But which client already does that?)

    In your case, I would simply change the last line of your code to increase the offset by the actual number of rows (binding) received:

            offset += n_results