marklogicmarklogic-10

XQuery Optimization in MarkLogic for Large Dataset Queries


I'm currently working on a project that involves querying large datasets in a MarkLogic database, and I've noticed that some of my XQuery scripts are running slower than expected, especially as the dataset grows. I'm looking for advice on how to optimize these queries to improve performance.

Here's an example of a query I'm using that retrieves all documents with a specific element value, but it seems to be particularly slow when the dataset is large. The database contains a mix of documents, and I'm interested in efficiently querying documents that have a <status> element with the value "active".

xquery version "1.0-ml";

declare namespace p = "http://example.com/project";

for $doc in /p:project[p:status = "active"]
return $doc

This query is straightforward, but as the number of documents grows, the performance degrades significantly. I suspect that the way I'm querying the documents could be improved, perhaps by better leveraging MarkLogic's indexing features or by restructuring the query itself.

Specifically, I'm looking for answers to the following questions:

  1. Is there a more efficient way to structure this XQuery to speed up the retrieval of documents based on the value of an element?
  2. Are there any MarkLogic settings or configurations that can be tuned to optimize query performance for large datasets?

Solution

  • As your database grows, you may find that you need to look into using cts:* (Core Text Search) functions or Optic API.

    For instance, your expression could be written as a cts:search() (which you could also use the for loop to iterate over the sequence, but if just returning $doc then there is no point for the for loop):

    xquery version "1.0-ml";
    declare namespace p = "http://example.com/project";
    cts:search(doc(), 
      cts:and-query((
        cts:document-root-query(xs:QName("p:project"),
        cts:element-value-query(xs:QName("p:status"), "active", "exact")
      )) 
    )
    

    The cts:element-value-query() will hit the term lists and subsequent calls can be faster if those items are cached in your List Tree Cache.

    Even faster would be if you had a range-index on the p:status element and then could use a range-query:

    xquery version "1.0-ml";
    declare namespace p = "http://example.com/project";
    cts:search(doc(), 
      cts:and-query((
        cts:document-root-query(xs:QName("p:project"),
        cts:element-range-query(xs:QName("p:status"), "=", "active")
      )) 
    )
    

    But if you are going to be returning a large set of documents, then you may find that more documents winds up taking more time (to retrieve and stream all of them back), and adding more indexes isn't going to help make that any faster.

    If you re-run the same query you may find that it's faster the second time if the documents are cached in the Compressed Tree Cache and Expanded Tree Cache, but if you attempt to return all docs from a growing set - you are more likely to hit Expanded Tree Cache errors.

    Instead, you might consider pagination and only returning a subset of the documents, or only returning information from the documents instead of the entire document (and returning from indexes would be even faster).

    When evaluating the performance of your queries, you can execute in Query Console and use the Profile Tab, use xdmp:plan() to inspect the query plan for a cts:search expression, and use xdmp:query-meters() to understand what sort of resources were involved in evaluating your query.