sparqlwikidata

Pull Wikidata companies with English websites


I'm trying to query businesses that have LinkedIn identifiers, ISINS, LEIs, or English websites. I only want one row of data per company if possible.

I'm struggling with the last part. I think I successfully identified that I only want English websites to populate, but the actual URLs are not populating in the result.

SELECT DISTINCT ?item ?itemLabel ?linkedin ?isin ?lei ?website WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
  {
    SELECT DISTINCT ?item ?linkedin ?isin ?lei ?website WHERE {
      ?item wdt:P31 wd:Q4830453.
      {
      ?item wdt:P4264 ?linkedin.}
      UNION
      {
      ?item wdt:P946 ?isin.}
      UNION
      {
      ?item wdt:P1278 ?lei.}
      UNION
      {
      ?item wdt:P856 ?statement0.
      ?statement0 p:P856 [ps:P856 ?website ; pq:P407 wd:Q1860] }

    }
    LIMIT 100
  }
}

Result from above query

The other data is populating, but not the websites. I'm also getting multiple records per company.


Solution

  • For what concerns websites, note that

    ?item wdt:P856 ?statement0.
    ?statement0 p:P856 [ps:P856 ?website ; pq:P407 wd:Q1860]
    

    is wrong. Just use ?item p:P856 [ps:P856 ?website ; pq:P407 wd:Q1860] .. Check Wikidata:SPARQL tutorial § Qualifiers for how to query for qualifiers.

    Your query can be corrected by first selecting only the ?item having at least one of such statements (using UNION), and then using OPTIONAL for getting all and only the statements that are actually specified for each ?item.

    The resulting query would be something like:

    SELECT DISTINCT ?item ?itemLabel ?linkedin ?isin ?lei ?website WHERE {
      SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
      OPTIONAL { ?item wdt:P4264 ?linkedin . }
      OPTIONAL { ?item wdt:P946 ?isin . }
      OPTIONAL { ?item wdt:P1278 ?lei . }
      OPTIONAL { ?item p:P856 [ps:P856 ?website ; pq:P407 wd:Q1860] . }
      {
        SELECT DISTINCT ?item WHERE {
          ?item wdt:P31 wd:Q4830453 .
          { ?item wdt:P4264 ?linkedin . }
          UNION
          { ?item wdt:P946 ?isin . }
          UNION
          { ?item wdt:P1278 ?lei . }
          UNION
          { ?item p:P856 [ps:P856 ?website ; pq:P407 wd:Q1860] . }
          hint:SubQuery hint:runOnce true .
        }
        LIMIT 100
      }
      hint:Prior hint:runFirst true .
    }
    

    Note that:

    1. You will still obtain multiple rows per company in case, for the same property, more than one value has been specified. For instance, Apple has multiple websites. In this case, for obtaining just one row for Apple you either choose one website at random (or any criterium that you prefer) or you may aggregate them using GROUP_CONCAT.
    2. Using LIMIT 100 in your subquery you may end up with incomplete results (for example, the same company may have both P4264 and P946, but maybe only the first one occurs in the first 100 results). In order to fix this issue, you should move out the LIMIT command.