marklogicmarklogic-dhf

How to speed up Harmonization in MarkLogic Data Hub


I am currently doing development work for the harmonization of our data. I noticed that completion of the harmonization is a slow.
I'm relatively new to the NoSQL / MarkLogic development and not sure of the best practices to follow in order for a smooth, faster harmonization.

Here are some facts:

Data Load:

  1. Data loaded in staging came from ERP system using a relational database. Data are extracted into CSV and loaded in MarkLogic
  2. Each of the relational table data are extracted into a CSV file. Each table are represented in separate entity.

Post harmonization:

  1. 227,826 records took about 66 mins to complete harmonization
  2. 1074151 records took about 4 hours and 19 mins to complete harmonization

Harmonization Code snippets:

  1. There are numerous date calculations logic (sample below)

function getScheduleWindowEnd(businessUnit,targetDateString,schEndDateString)
   {
   	var scheduleWindowEnd = new String();
    var preferredDate = new Date(); 
    var startDayOfWeek = getBUStartDayOfWeek(businessUnit);
    
    if (fn.empty(targetDateString) || targetDateString == null || targetDateString == "" || 
       fn.empty(schEndDateString) || schEndDateString == null || schEndDateString == "")
    {
     tempScheduleWindowEnd = "";
     return "";
    }
    else
    {
      targetDateString = fn.replace(targetDateString, "/", "-") ;
      schEndDateString = fn.replace(schEndDateString,"/","-");

      var targetDate = xs.date(targetDateString);
      var schEndDate = xs.date(schEndDateString);
    
      // Get preferred date
      if (fn.empty(schEndDate)) 
      {
        preferredDate = targetDate;
      }
      else
      {
       preferredDate = schEndDate;
      }
      
      //get target day of week
      var scheduledDayOfWeek = xdmp.weekdayFromDate(preferredDate);
    
      if (scheduledDayOfWeek < startDayOfWeek)
      {
      scheduleWindowEnd = fn.string(addDays(preferredDate,(startDayOfWeek-scheduledDayOfWeek)));  
      }
      else
      {
      scheduleWindowEnd = fn.string(addDays(preferredDate,(startDayOfWeek-scheduledDayOfWeek+7)));
      }
      
      scheduleWindowEnd = fn.replace(fn.substring(scheduleWindowEnd, 1, 10), "-", "/");
      tempScheduleWindowEnd = scheduleWindowEnd;
    }
     
     return scheduleWindowEnd
   }

  1. The main Entity get some of the element data from the other entities (in the example below, Table2 is the other entity)

<StatusDescription>${fn.normalizeSpace(getUDCDescription("00", "SS", fn.normalizeSpace(hl.elementText(source, "WASRST", true))))}</StatusDescription>

function getUDCDescription(drsy,drrt,drky) {
     let udcRecord =  cts.search(cts.andQuery([
       cts.collectionQuery("ERPSystemSource"),
       cts.collectionQuery("Table2"),
       cts.elementWordQuery(xs.QName("DRSY"), drsy),
       cts.elementWordQuery(xs.QName("DRRT"), drrt),
       cts.elementWordQuery(xs.QName("DRKY"), drky)
     ]))
     
     let docXML = new String();
     for (const item of udcRecord) {
       docXML += hl.encodeXml(fn.normalizeSpace(hl.elementText(item, "DRDL01", true)))
     }
     return docXML;
   }

  1. Some of the harmonized data are one to one (direct fetch). See sample below:

        <Element1>${hl.elementText(source, "WADOCO", true)}</Element1>
        <Element2>${fn.normalizeSpace(hl.elementText(source, "WAMCU", true))}</Element2>

  1. There are numerous for loops call (not nested), about 20 calls. Sample in #2 above:

Solution

  • I would strongly recommend if possible that you consider working with a MarkLogic representative on this problem. Improving software performance can be complex and it's best to have a working relationship with someone who can go back and forth with you.

    The first question I always ask is : Well, what is your expected SLA? Until you have a clear expectation set of what you think performance should look like I can't tell you that this is slow or fast or if your expectation is realistic or unrealistic.

    In my experience, performance issues tend to fall into one of two categories : Software or Infrastructure bottlenecks. Since the extrapolation in time from 200k to 1m records seems linear, I would expect that your bottleneck is not a severe software issue.

    The first thing I would do is check the MarkLogic monitoring history and determine whether you are fully utilizing your infrastructure. If not, try increasing the threadcounts and batch sizes of your harmonization workloads so that your infrastructure is fully utilized.

    If you are fully utilizing your infrastructure, you can either upgrade your infrastructure or you can start to look at improving your software.

    Based off your code, here are several suggestions you can look into to improve your software:

    1. Make your cts.search calls unfiltered (when possible)
    2. Limit cts.search calls where possible. I get the sense you may be doing more than just one per job.
    3. Consider using cts.elementValues in place of cts.search if you only need to pull one element from a document
    4. Normalize space and other string functions can be heavy on large strings of text. If you're working with large strings consider whether you can cut down the amount of times you use them.
    5. I suggest implementing a unique XML namespace on the document for each type of table you pull in so you don't need the collection queries.