I'm querying DBPedia's Virtuoso endpoint via RDFLib in order to get all entities of type dbo:Politician with no other occupation than that, and I have noticed that the results I get when performing the query with increasing OFFSET
s over the LIMIT
(10000) doesn't contain all results
def get_persons_for_occupation(occupation_URI):
offset = 0
limit = 10000 # DBPedia's Virtuoso SPARQL limit
persons = []
while True:
g = Graph()
try:
query = """
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT *
WHERE {
SERVICE <https://dbpedia.org/sparql> {
SELECT ?person_with_occupation ?wikipedia_URL ?wikidata_URI
WHERE {
?person_with_occupation rdf:type/rdfs:subClassOf* %s.
# TODO for debugging
# FILTER(regex(?person_with_occupation, "Trump"))
# Discard persons with an occupation (class) different than ours
FILTER NOT EXISTS {
?person_with_occupation a ?other_occupation.
# That is not the occupation itself
FILTER ((?other_occupation != %s)
&&
# That is not a subclass of ours (* allows for indirect subclasses through the type hierarchy)
NOT EXISTS { ?other_occupation rdfs:subClassOf* %s }
&&
# And that is a subclass of dbo:Person
EXISTS { ?other_occupation rdfs:subClassOf dbo:Person })
}
# They have a Wikipedia article
?person_with_occupation foaf:isPrimaryTopicOf ?wikipedia_URL.
# And also an equivalent URI in Wikidata (in order to get its PageRank)
?person_with_occupation owl:sameAs ?wikidata_URI.
FILTER (STRSTARTS(STR(?wikidata_URI), "http://www.wikidata.org"))
}
LIMIT 10000
OFFSET %s
}
}
""" % (occupation_URI, occupation_URI, occupation_URI, offset)
qres = g.query(prepareQuery(query))
except SPARQLResult as e:
# Received correct but partial results (on the final offset),
# we don't want it to be an exception
if e.response.status_code == 206:
qres = JSONResultParser().parse(e.response.content)
else:
raise
n_results = len(qres)
if n_results == 0:
break
#for row in qres:
#do stuff
offset += limit
where occupation_URI = "dbo:Politician"
When collecting all results, I noticed that I got 27792 entities, but there are 74128 of them if I ask for a COUNT
(especially, some entities such as Donald Trump's are not returned, but if I FILTER
for it, it is returned). Is there a hard limit that I don't know of?
This is probably caused by the Anytime Queries feature / odd behavior / bug that lets Virtuoso return incomplete results without telling the client in a standards compliant way. That even happens inside of aggregations, which might explain varying COUNT results. (Detail can be found in the long and fruitless discussion at openlink/virtuoso-opensource#112.) A client could recognize the incomplete result by checking for the HTTP response header X-SQL-State: S1TAT
. (But which client already does that?)
In your case, I would simply change the last line of your code to increase the offset by the actual number of rows (binding) received:
offset += n_results