sparqlrdfturtle-rdfrdf4j

SPARQL query formation


I have RDF data and I want to form a SPARQL query to fetch records that match with a particular organism name.

Just FYI, I used RDF4J to generate RDF records using JSONLD data available. I am having problem in fetching records that match any particular set of PropertyValue. Example: all records having organism as Equus caballus or all records having submission identifier as GSB-7331.

Any help is much appreciated.

Data records are like:

@prefix schema: <http://schema.org/> .
@prefix obo: <http://purl.obolibrary.org/obo/> .
@prefix ebi-bsd: <https://www.ebi.ac.uk/biosamples/> .
@prefix biosamples: <http://identifiers.org/biosample/> .

biosamples:SAMEA104496657 a schema:DataRecord ;
schema:dateCreated "0002-10-15T00:00:00Z"^^schema:Date ;
schema:dateModified "2019-07-23T18:33:14.867Z"^^schema:Date ;
schema:identifier "SAMEA104496657" ;
schema:isPartOf ebi-bsd:samples ;
schema:mainEntity _:b0 .

ebi-bsd:samples a schema:Dataset .

_:b0 a schema:Sample , obo:OBI_0000747 ;
schema:additionalProperty _:b1 , _:b2 , _:b3 , _:b4 ;
schema:description "Blood samples N123" ;
schema:identifier "SAMEA104496657" ;
schema:name "N123" ;
schema:sameAs biosamples:SAMEA104496657 .

_:b1 a schema:PropertyValue ;
schema:name "organism" ;
schema:value "Equus caballus" ;
schema:valueReference obo:NCBITaxon_9796 .

obo:NCBITaxon_9796 a schema:DefinedTerm .

_:b2 a schema:PropertyValue ;
schema:name "submission description" ;
schema:value "ELOAD_294_samples" .

_:b3 a schema:PropertyValue ;
schema:name "submission identifier" ;
schema:value "GSB-7331" .

_:b4 a schema:PropertyValue ;
schema:name "submission title" ;
schema:value "ELOAD_294" .
@prefix schema: <http://schema.org/> .
@prefix obo: <http://purl.obolibrary.org/obo/> .
@prefix ebi-bsd: <https://www.ebi.ac.uk/biosamples/> .
@prefix biosamples: <http://identifiers.org/biosample/> .

biosamples:SAMEA104625758 a schema:DataRecord ;
schema:dateCreated "0014-06-07T00:00:00Z"^^schema:Date ;
schema:dateModified "2019-08-06T17:46:01.812Z"^^schema:Date ;
schema:identifier "SAMEA104625758" ;
schema:isPartOf ebi-bsd:samples ;
schema:mainEntity _:b0 .

ebi-bsd:samples a schema:Dataset .

_:b0 a schema:Sample , obo:OBI_0000747 ;
schema:additionalProperty _:b1 , _:b2 , _:b3 ;
schema:description "Colorectal Cancer Tumor Sequenced Samaple;      
schema:identifier "SAMEA104625758" ;
schema:name "P-0009062-T01-IM5" ;
schema:sameAs biosamples:SAMEA104625758 ;
schema:subjectOf "http://www.ebi.ac.uk/ena/data/view/SAMEA104625758" .

:b1 a schema:PropertyValue ;
schema:name "common name" ;
schema:value "Human" ;
schema:valueReference obo:NCBITaxon_9606 .

obo:NCBITaxon_9606 a schema:DefinedTerm .

_:b2 a schema:PropertyValue ;
schema:name "organism" ;
schema:value "Homo sapiens" ;
schema:valueReference obo:NCBITaxon_9606 .

_:b3 a schema:PropertyValue ;
schema:name "scientific name" ;
schema:value "Homo sapiens" ;
schema:valueReference obo:NCBITaxon_9606 .

The code I use to generate the RDF TURTLE data is below, I download sample data in JSONLD from - https://www.ebi.ac.uk/biosamples/samples/SAMN03177689.ldjson

import org.apache.commons.io.FileUtils;
import org.eclipse.rdf4j.model.Statement;
import org.eclipse.rdf4j.rio.RDFFormat;
import org.eclipse.rdf4j.rio.RDFHandlerException;
import org.eclipse.rdf4j.rio.RDFParser;
import org.eclipse.rdf4j.rio.Rio;
import org.eclipse.rdf4j.rio.helpers.StatementCollector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.InputStream;
import java.io.StringWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.Collection;
import java.util.Scanner;
import java.util.concurrent.Callable;

public class BioSchemasRdfGenerator implements Callable<Void> {
    private Logger log = LoggerFactory.getLogger(getClass());
    private static File file;
    private static long sampleCount = 0;
    private final URL url;

    public static void setFilePath(String filePath) {
        file = new File(filePath);
    }

    BioSchemasRdfGenerator(final URL url) {
        log.info("HANDLING " + url.toString() + " and the current sample count is: " + ++sampleCount);

        this.url = url;
    }

    @Override
    public Void call() throws Exception {
        requestHTTPAndHandle(this.url);

        return null;
    }

    private static void requestHTTPAndHandle(final URL url) throws Exception {
        final HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        int response;

        try {
            conn.setRequestMethod("GET");
            conn.connect();
            response = conn.getResponseCode();

            if (response == 200) {
                handleSuccessResponses(url);
            }
        } catch (final Exception e) {
            throw new RuntimeException(e);
        } finally {
            conn.disconnect();
        }
    }

    private static void handleSuccessResponses(final URL url) {
        try (Scanner sc = new Scanner(url.openStream())) {
            final StringBuilder sb = new StringBuilder();

            while (sc.hasNext()) {
                sb.append(sc.nextLine());
            }

            try (InputStream in = new ByteArrayInputStream(sb.toString().getBytes(StandardCharsets.UTF_8))) {
                String dataAsRdf = readRdfToString(in);

                write(dataAsRdf);
            } catch (final Exception e) {
                throw new RuntimeException(e);
            }
        } catch (final Exception e) {
            throw new RuntimeException(e);
        }
    }

    @SuppressWarnings(value = "deprecation")
    private static void write(final String sampleData) throws Exception {
        FileUtils.writeStringToFile(file, sampleData, true);
    }

    /**
     * @param in a rdf input stream
     * @return a string representation
     */
    private static String readRdfToString(final InputStream in) {
        return graphToString(readRdfToGraph(in));
    }

    /**
     * @param inputStream an Input stream containing rdf data
     * @return a Graph representing the rdf in the input stream
     */
    private static Collection<Statement> readRdfToGraph(final InputStream inputStream) {
        try {
            final RDFParser rdfParser = Rio.createParser(RDFFormat.JSONLD);
            final StatementCollector collector = new StatementCollector();

            rdfParser.setRDFHandler(collector);
            rdfParser.parse(inputStream, "");

            return collector.getStatements();
        } catch (final Exception e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * Transforms a graph to a string.
     *
     * @param myGraph a sesame rdf graph
     * @return a rdf string
     */
    private static String graphToString(final Collection<Statement> myGraph) {
        final StringWriter out = new StringWriter();
        final TurtleWriterCustom turtleWriterCustom = new TurtleWriterCustom(out);

        return modifyIdentifier(writeRdfInTurtleFormat(myGraph, out, turtleWriterCustom));
    }

    private static String modifyIdentifier(String rdfString) {
        if (rdfString != null)
            rdfString = rdfString.replaceAll("biosample:", "");

        return rdfString;
    }

    private static String writeRdfInTurtleFormat(Collection<Statement> myGraph, StringWriter out, TurtleWriterCustom writer) {
        try {
            writer.startRDF();
            handleNamespaces(writer);

            for (Statement st : myGraph) {
                writer.handleStatement(st);
                //below line is commented: for short RDF
                //writer.writeValue(st.getObject(),O true);
            }

            writer.endRDF();
        } catch (final RDFHandlerException e) {
            throw new RuntimeException(e);
        }

        return out.getBuffer().toString();
    }

    private static void handleNamespaces(final TurtleWriterCustom writer) {
        writer.handleNamespace("schema", "http://schema.org/");
        writer.handleNamespace("obo", "http://purl.obolibrary.org/obo/");
        writer.handleNamespace("ebi-bsd", "https://www.ebi.ac.uk/biosamples/");
        writer.handleNamespace("biosamples", "http://identifiers.org/biosample/");
    }
}

Solution

  • Your code looks an awful lot more complex than it needs to be. To load the JSON-LD file on the remote URL as an RDF model using RDF4J, you can simply do this:

    String file = "https://www.ebi.ac.uk/biosamples/samples/SAMN03177689.ldjson";
    try (InputStream input = new URL(file).openStream()) {
        Model m = Rio.parse(input, file, RDFFormat.JSONLD);
    }
    

    If you then wish to write this model in Turtle syntax, all you have to do is this:

    // replace System.out with your own outputstream if you want to write to file
    Rio.write(m, System.out, RDFFormat.TURTLE); 
    

    If I run this on your sample file, I get this:

    @prefix biosample: <http://identifiers.org/biosample/> .
    @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    @prefix schema: <http://schema.org/> .
    @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
    
    biosample:SAMN03177689 a schema:DataRecord;
      schema:dateCreated "2014-12-12T06:54:48.957Z"^^schema:Date;
      schema:dateModified "2019-03-13T09:41:33.81Z"^^schema:Date;
      schema:identifier "biosample:SAMN03177689";
      schema:isPartOf <https://www.ebi.ac.uk/biosamples/samples>;
      schema:mainEntity <https://www.ebi.ac.uk/biosamples/samples/SAMN03177689> .
    
    <https://www.ebi.ac.uk/biosamples/samples> a schema:Dataset .
    
    <https://www.ebi.ac.uk/biosamples/samples/SAMN03177689> a <http://purl.obolibrary.org/obo/OBI_0000747>,
        schema:Sample;
      schema:additionalProperty _:genid-2e6f9d5c4cc34db8b5ab6e72e7857e31-b0 .
    
    _:genid-2e6f9d5c4cc34db8b5ab6e72e7857e31-b0 a schema:PropertyValue;
      schema:name "INSDC center name";
      schema:value "FDA" .
    
      [snip]
    

    Note, that the instance of schema:Sample here has an actual IRI as identifier, not a blank node: <https://www.ebi.ac.uk/biosamples/samples/SAMN03177689>.

    There are some odd things going on in your code. First of all there is this method modifyIdentifier. For some reason it snips off all occurrences of biosample: with the empty string. I'm not sure why you'd want to do that (it seems a bad idea to manipulate the string data in this fashion). It also does it in a way where the output is invalid Turtle syntax. If, in the above example, you'd replace biosample: with the empty string, you'd get, on line 1:

    @prefix <http://identifiers.org/biosample/> .
    

    which is not a valid prefix definition (it misses a colon after prefix). And further down, you'd have

    SAMN03177689 a schema:DataRecord;
    

    This is not a valid IRI reference.

    Then there is this TurtleWriterCustom class. You don't show the code for that class, but given its name I suspect it's trying to do some further non-standard customizing of the output, and in doing so messes up your sample identifiers, somehow replacing them with (identical) blank nodes.

    To be honest I'm not even sure why you are converting from JSON-LD to Turtle at all, because if your goal is to load this data into an RDF database so you can do SPARQL queries, you can just load the JSON-LD file directly:

    Repository repo = ...; // your RDF4J database
    try (RepositoryConnection conn = repo.getConnection()) {
         conn.add(input, file, RDFFormat.JSONLD);
    
         // data added to database - you can now query. 
         String query = "prefix schema: <http://schema.org/> "
                + "select ?r {?r a schema:DataRecord ; "
                + "schema:mainEntity [schema:additionalProperty [schema:name \"organism\" ; schema:value \"Escherichia coli\"] ] }";
    
         conn.prepareTuplequery(query).evaluate().forEach(bs -> System.out.println(bs));
    }
    

    result:

    [r=http://identifiers.org/biosample/SAMN03177689]