marklogicmarklogic-8marklogic-9

Complex Xquery in MarkLogic for documents in multiple collections


We have huge data set in MarkLogic with documents spread across multiple collections. We are required to develop reports by searching document spread across these collections.

Sample data set:

    Collection1 - Inventory [Contains 5 million Inventory documents]

    URI: /inventory/inv1
        <xml>
            <INVENTORY>
                <ItemName>10</ItemName>
                <InventoryQuantity>100</InventoryQuantity>
                .
                .
            </INVENTORY>
        </xml>


    Collection2 - Item[Contains 1 million Item documents]

    URI: /item/item1
        <xml>
            <Item>
                <ItemName>10</ItemName>
                <ItemWmos>
                    <UnitPrice>895</UnitPrice>
                    .
                    .
                <ItemWmos>
                .
                .
            </Item>
        </xml>


          For each Inventory in the Inventory collection
            Step 1 : Get "ItemName", its "InventoryQuantity" and for the same "ItemName" find "UnitPrice" from Item document in Item collection.
            Step 2 : CurrentInventoryValue = InventoryQuantity * UnitPrice
            Step 3 : TotalInventoryValue = TotalInventoryValue + CurrentInventoryValue;
         Repeat;

I have achieved the above report requirement using XQuery below. However its taking lot of time to execute and showing timeout exception. I am not able to leverage Path Range Index here as I have to get "ItemName" and its "InventoryQuantity" together and search for the unit price in a different collection document having the same item name.

XQuery :

    sum(for $doc in cts:search(doc(), cts:and-query((cts:collection-query("Inventory"))))
        [(fn:string-length(//INVENTORY/ItemName/text()) > 0) and (fn:string-length(//INVENTORY/InventoryQuantity/text()) > 0)]
        let $itemName := $doc//INVENTORY/ItemName/text()
        let $inventoryQuantity := $doc//INVENTORY/InventoryQuantity/text()
    return (
        for $doc in cts:search(doc(), cts:and-query((cts:collection-query("Item"))))[//Item/ItemName/text()=$itemName 
        and (fn:string-length(//ItemWmos/UnitPrice/text()) > 0)]
        return ($inventoryQuantity * $doc//ItemWmos/UnitPrice/text())
    ))

How do we accomplish such complex query requirement in MarkLogic in efficient way ?


Solution

  • Band Aid Approach:

    You can probably get to much better performance by rewriting your XPath expressions to be absolute as well as transforming them into cts query logic. But even if it solves the immediate problem, it will not scale out much farther (because there is no limit to how many documents you may iterate over):

    (Note: I've not tested this code so syntax errors may exist)

    sum(for $doc in cts:search(doc(), cts:and-query((cts:collection-query("Inventory"))), "unfiltered")
            let $itemName := $doc/xml/INVENTORY/ItemName/string()
            let $inventoryQuantity := $doc/xml/INVENTORY/InventoryQuantity/string()
        where
          fn:string-length($itemName) gt 0 and fn:string-length($inventoryQuantity) gt 0
        return (
            for $doc in cts:search(doc(), cts:and-query((cts:collection-query("Item")), cts:elementValueQuery("ItemName", $itemName)), "unfiltered")
            return (xs:integer($inventoryQuantity) * xs:integer($doc/xml/Item/ItemWmos/UnitPrice/string()))
        ))
    

    Scalable Approach #1:

    First thing I would do is transform and denormalize the /inventory/ data you have in MarkLogic to the following format as it is ingested:

    <envelope>
      <headers xmlns="http://yoursite.com/item/headers"> <!-- Note the unique namespace -->
         <ItemName>10</ItemName>
         <InventoryQuantity>100</InventoryQuantity>
         <UnitPrice>85</UnitPrice>
         <ItemPrice>8500</ItemPrice>
      </headers>
      <attachments>
        <INVENTORY>
          <ItemName>10</ItemName>
          <InventoryQuantity>100</InventoryQuantity>
          .
          .
        </INVENTORY>
        <Item>
          <ItemName>10</ItemName>
          <ItemWmos>
            <UnitPrice>895</UnitPrice>
            .
            .
          </ItemWmos>
          .
          .
        </Item>
     </attachments>
    </envelope>
    

    Now, to run the report, all you need to do is put an ElementRangeIndex on <ItemPrice> and run cts:sum-aggregate .

    If you use the DataHub Framework you can more simply manage the processes used to pull in disjoint data and harmonize it.

    Scalable Approach #2:

    You can keep your data the way it is but you can place TDE Views on top of it which will index the data and allow you to use the Optic API to run performant joins and aggregates from it. There is some background overhead to this which makes Approach #1 more performant in an apples-to-apples comparison but the flexibility to run a wider variety of queries quickly without having to change ingest may make it worthwhile to you.