architecturerdfknowledge-graphrdf4jdata-fabric

Data fabric architecture design - direct SPARQL access or API abstraction


In a data fabric architecture, is it better to give direct access to internal SPARQL databases or to use an API?

I am reviewing an architecture for a data fabric. The architecture has a collection of central knowledge graphs hosted behind SPARQL endpoints which are abstracted by GRPC APIs. There is one API per product to have clear fault lines for the data ingestion. I want to suggest to consolidate the API endpoints and to have only two direct SPARQL endpoints one for read and one for write to give all products read-write access as they wish.

Any ideas or advice would be cool.


Solution

  • This is a great question and one that has multiple approaches. I completely agree with having separate endpoints for read and write. Command / Query separation is definitely a good things.

    FOR WRITES

    To further advance this, you might want to make to make the write endpoints of 2 varieties.

    1. REST/HTTP based with constraints (but not following SPARQL protocols)
    2. Named Graph based (following SPARQL Graph update protocols, with constraints).

    I'd be careful about supporting SPARQL update in SPARQL queries.

    The former might ultimately resolve to the latter. But the key thing is to make use of Named Graphs since they will help with bounding and atomicity if you do decide to centralise.

    FOR READS

    FYI SPARQL1.1 supports federation so the reads could still be decentralised and federated. Federation does have some limitations and can ultimately result in performance issues if concerns are badly separated.

    I think ultimately it will come down to the size, shape and affinity of the data, correlated to the latency and frequency of change ALONG with access control considerations. The granularity of you access control might be simplified by more SPARQL services and named graphs.