I need to provide isolation between similar triples in different graphs (collections) in MarkLogic. For this to work I have to specify which graph I want the triples to be retrieved from, and my approach is this:
cts:triples(
(),
sem:iri("http://something/predicate#somepredicate"), "SomeObject", (), (),
cts:collection-query("someCollection") )
This works, but it performs poorly because of the collection-query. Are there any better ways to limit results to only these of a given graph?
I tried to create a test case for this, using 7.0-4 on my laptop. It seems pretty fast to me: take a look and see where it's different from what you're doing. My guess is that your query returns many triples, and that's the bottleneck. Matching triples is very fast, but returning large numbers of them can be relatively slow.
First let's use taskbot to generate some triples.
(: insert test documents with taskbot :)
import module namespace tb="ns://blakeley.com/taskbot"
at "src/taskbot.xqm" ;
import module namespace sem="http://marklogic.com/semantics"
at "MarkLogic/semantics.xqy";
tb:list-segment-process(
(: Total size of the job. :)
1 to 1000 * 1000,
(: Size of each segment of work. :)
500,
(: Label. :)
"test/triples",
(: This anonymous function will be called for each segment. :)
function($list as item()+, $opts as map:map?) {
(: Any chainsaw should have a safety. Check it here. :)
tb:maybe-fatal(),
let $triples := $list ! sem:triple(
sem:iri("subject"||xdmp:random()),
sem:iri("predicate"||xdmp:random(19)),
"object"||xdmp:random(49),
sem:iri('graph'||xdmp:random(9)))
return sem:rdf-insert($triples)
,
(: This is an update, so be sure to commit each segment. :)
xdmp:commit() },
(: options - not used in this example. :)
map:new(map:entry('testing', '123...')),
(: This is an update, so be sure to say so. :)
$tb:OPTIONS-UPDATE)
Now, taskbot does most of the work on the Task Server. So monitor ErrorLog.txt
or just wait for the CPU to go down and the triple count to hit 1M. After that, let's see what we loaded:
count(cts:triples()),
count(cts:triples((), sem:iri("predicate0"))),
count(cts:triples((), (), "object0")),
count(
cts:triples((), (), (), (), (), cts:collection-query("graph0")))
=>
1000000
49977
19809
100263
You might get a different counts for the predicate, object, and collection: remember that the data was generated randomly. But let's try a query.
count(
cts:triples(
(), sem:iri("predicate0"), "object0",
(), (), cts:collection-query("graph0")))
, xdmp:elapsed-time()
Results:
100
PT0.004991S
That seems pretty fast to me: 5-ms. You might get a different count because the data was generated randomly, but it should be close.
Now, a larger result set will slow this down. For example:
count(
cts:triples(
(), (), (),
(), (), cts:collection-query("graph0")))
, xdmp:elapsed-time()
=>
100263
PT0.371252S
count(cts:triples())
, xdmp:elapsed-time()
=>
1000000
PT2.906235S
count(cts:triples()[1 to 1000])
, xdmp:elapsed-time()
=>
1000
PT0.002707S
As you can see, the response time is roughly O(n) with the number of triples. Actually it's a little better than O(n), but in that ballpark. In any case the cts:collection-query
doesn't look like the problem.