I am trying to query the frequency of certain attributes in Wikidata, using SPARQL.
For example, to find out what the frequency of different values for gender is, I have the following query:
SELECT ?rid (COUNT(?rid) AS ?count)
WHERE { ?qid wdt:P21 ?rid.
BIND(wd:Q5 AS ?human)
?qid wdt:P31 ?human.
} GROUP BY ?rid
I get the following result:
wd:Q6581097 2752163
wd:Q6581072 562339
wd:Q1052281 223
wd:Q1097630 68
wd:Q2449503 67
wd:Q48270 36
wd:Q44148 8
wd:Q43445 4
t152990852 1
t152990762 1
t152990752 1
t152990635 1
t152775383 1
t152775370 1
t152775368 1
...
I have the following questions regarding this:
t152...
values refer to?t152...
?FILTER ( !strstarts(str(?rid), "wd:") )
but it timed out.SELECT (COUNT(DISTINCT ?rid) AS ?count)
with the above query, but again it timed out.Values starting with t
are "skolemized" unknown values (see, e.g., Q2423351 for a person of unknown sex or gender).
In order to improve performance, I suggest you to divide your query into three parts:
All "normal" genders:
SELECT ?rid (COUNT(?qid) AS ?count)
WHERE {
?qid wdt:P31 wd:Q5.
?qid wdt:P21 ?rid.
?rid wdt:P31 wd:Q48264
} GROUP BY ?rid ORDER BY DESC(?count)
Please note that, according Wikidata, wd:Q746411 is a subclass of wd:Q48270, etc.
All "non-normal" genders:
SELECT ?rid (COUNT(?qid) AS ?count)
WHERE {
?qid wdt:P31 wd:Q5.
?qid wdt:P21 ?rid.
FILTER (?rid NOT IN
(
wd:Q6581097,
wd:Q6581072,
wd:Q1052281,
wd:Q2449503,
wd:Q48270,
wd:Q746411,
wd:Q189125,
wd:Q1399232,
wd:Q3277905
)
).
FILTER (isURI(?rid))
} GROUP BY ?rid ORDER BY DESC(?count)
I do not use FILTER NOT EXISTS {?rid wdt:P31 wd:Q48264 }
due to performance reasons.
All (i.e. 1) "unknown" genders:
SELECT (COUNT(?qid) AS ?count)
WHERE {
?qid wdt:P31 wd:Q5.
?qid wdt:P21 ?rid.
FILTER (!isURI(?rid))
}
In fact, it is not very important in your case — to count distinct wd:Q5 or count them not distinct — but the latter is preferable due to performance reasons.