I'm trying to get a representation of the infobox of articles on Wikipedia in a Python project. I had tried using the Wikipedia API, but the data it outputs is dirty, so I'm trying to move to DBpedia. I need to be able to query by page name, and receive a dictionary of the property names and their values for that page. For example, for the query for London, the returned dictionary would contain:
{dbpedia-owl:PopulatedPlace/areaMetro : 8382.0,
dbpedia-owl:PopulatedPlace/areaTotal : 1572.0
.....
dbpedia-owl:populationDensity : 5285.0
.....
}
etc., and from this I would be able to read all the keys that were in the Infobox. I did try using the SPARQL query of
describe <http://dbpedia.org/resource/London>
but that returned tonnes of unnecessary data &emdash; the full set of triplets associated with London &emdash; which is many orders of magnitude more than I need.
How can I write a query to just get the infobox properties, as above?
You might be able to get what you want by selecting properties and objects where the property IRI begins with something you're interested in (e.g., http://dbpedia.org/ontology/). You could use a query like the following. (It takes advantage of the fact that a prefix by itself, e.g., dbpedia-owl:, is still a legal IRI, and you can use str on it. You could also just use the string http://dbpedia.org/ontology/
select ?p ?o where {
dbpedia:London ?p ?o
filter strstarts(str(?p),str(dbpedia-owl:))
}
SPARQL results (HTML Table)
SPARQL results (JSON)
The JSON results aren't quite in the format you're looking for, but are like this:
{ "head": { "link": [], "vars": ["p", "o"] },
"results": { "distinct": false, "ordered": true, "bindings": [
{ "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" } , "o": { "type": "uri", "value": "http://mapoflondon.uvic.ca/" }},
{ "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" } , "o": { "type": "uri", "value": "http://www.british-history.ac.uk/place.aspx?region=1" }},
{ "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" } , "o": { "type": "uri", "value": "http://www.london.gov.uk/" }},
{ "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" } , "o": { "type": "uri", "value": "http://www.museumoflondon.org.uk/" }},
{ "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" } , "o": { "type": "uri", "value": "http://www.tfl.gov.uk/" }},
{ "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" } , "o": { "type": "uri", "value": "http://www.visitlondon.com/" }},
{ "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" } , "o": { "type": "uri", "value": "https://london.gov.uk/" }},
{ "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" } , "o": { "type": "uri", "value": "http://www.britishpathe.com/workspace.php?id=2449&delete_record=75105/" }},
{ "p": { "type": "uri", "value": "http://dbpedia.org/ontology/thumbnail" } , "o": { "type": "uri", "value": "http://commons.wikimedia.org/wiki/Special:FilePath/Greater_London_collage_2013.png?width=300" }},
...
That sort of makes sense though, because there's not necessarily a unique value for each property, so a Python dict as in the question probably isn't the best result format (but it'd be easy to create one where multiple values are put into a list).
Also note that the properties that begin with dbpedia-owl: are actually the DBpedia Ontology properties, which have much cleaner data than the raw infobox values, for which properties beginning with dbpprop: are used. You can read more about the different datasets at 4.3. Infobox Data. A query for the raw properties would be pretty much the same though:
select ?p ?o where {
dbpedia:London ?p ?o
filter strstarts(str(?p),str(dbpprop:))
}