xquerymarklogictriplestoretriplesn-triples

Selecting triples from specific graph in MarkLogic7


I need to provide isolation between similar triples in different graphs (collections) in MarkLogic. For this to work I have to specify which graph I want the triples to be retrieved from, and my approach is this:

cts:triples(
  (),
  sem:iri("http://something/predicate#somepredicate"), "SomeObject", (), (),
  cts:collection-query("someCollection") )  

This works, but it performs poorly because of the collection-query. Are there any better ways to limit results to only these of a given graph?


Solution

  • I tried to create a test case for this, using 7.0-4 on my laptop. It seems pretty fast to me: take a look and see where it's different from what you're doing. My guess is that your query returns many triples, and that's the bottleneck. Matching triples is very fast, but returning large numbers of them can be relatively slow.

    First let's use taskbot to generate some triples.

    (: insert test documents with taskbot :)
    import module namespace tb="ns://blakeley.com/taskbot"
      at "src/taskbot.xqm" ;
    import module namespace sem="http://marklogic.com/semantics" 
      at "MarkLogic/semantics.xqy";
    
    tb:list-segment-process(
      (: Total size of the job. :)
      1 to 1000 * 1000,
      (: Size of each segment of work. :)
      500,
      (: Label. :)
      "test/triples",
      (: This anonymous function will be called for each segment. :)
      function($list as item()+, $opts as map:map?) {
        (: Any chainsaw should have a safety. Check it here. :)
        tb:maybe-fatal(),
        let $triples := $list ! sem:triple(
          sem:iri("subject"||xdmp:random()),
          sem:iri("predicate"||xdmp:random(19)),
          "object"||xdmp:random(49),
          sem:iri('graph'||xdmp:random(9)))
        return sem:rdf-insert($triples)
        ,
        (: This is an update, so be sure to commit each segment. :)
        xdmp:commit() },
      (: options - not used in this example. :)
      map:new(map:entry('testing', '123...')),
      (: This is an update, so be sure to say so. :)
      $tb:OPTIONS-UPDATE)
    

    Now, taskbot does most of the work on the Task Server. So monitor ErrorLog.txt or just wait for the CPU to go down and the triple count to hit 1M. After that, let's see what we loaded:

    count(cts:triples()),
    count(cts:triples((), sem:iri("predicate0"))),
    count(cts:triples((), (), "object0")),
    count(
      cts:triples((), (), (), (), (), cts:collection-query("graph0")))
    =>
    1000000
    49977
    19809
    100263
    

    You might get a different counts for the predicate, object, and collection: remember that the data was generated randomly. But let's try a query.

    count(
      cts:triples(
        (), sem:iri("predicate0"), "object0",
        (), (), cts:collection-query("graph0")))
    , xdmp:elapsed-time()
    

    Results:

    100
    PT0.004991S
    

    That seems pretty fast to me: 5-ms. You might get a different count because the data was generated randomly, but it should be close.

    Now, a larger result set will slow this down. For example:

    count(
      cts:triples(
        (), (), (),
        (), (), cts:collection-query("graph0")))
    , xdmp:elapsed-time()
    =>
    100263
    PT0.371252S
    
    count(cts:triples())
    , xdmp:elapsed-time()
    =>
    1000000
    PT2.906235S
    
    count(cts:triples()[1 to 1000])
    , xdmp:elapsed-time()
    =>
    1000
    PT0.002707S
    

    As you can see, the response time is roughly O(n) with the number of triples. Actually it's a little better than O(n), but in that ballpark. In any case the cts:collection-query doesn't look like the problem.