xmldatabasexquerybasexxml-database

How to optimize XQuery fn:count() in FLWOR (Parallelize)?


I'm using BaseX XML database and have a lot of XML data, approximately 50 000 files of various size. However, one of my local functions I have implemented are to computational heavy. Unfortunately it is very crucial in my work.

Let us assume I have 50 000 files for every Student, and every Student has an attribute called friend. I want to find out for each Student, how many friends the Student has.

Here are some example code:

declare variable $context := /Students

declare function local:CalculateFriends($student)
{
 let $studentName := $student/@Name
 return fn:count($context[@friend = $studentName])
}

for $s in $context
let $numberOfFriends := local:CalculateFriends($s)
return <Student Name = '{$s/@Name}' NumberOfFriends = '{$numberOfFriends}' />

This code works fine for one single student. For 1000 students, it takes approximately 5 minutes. Imagine for 50 000 students. It either crashes or gets timeout, I cannot debug it. Left it to calculate overnight and came back, nothing happened.

Is there a way to optimize this? Since using @friend = $studentName it makes use of attribute index (it is enabled). Having taken a parallel course in university, my first thought was to parallelize the count and flwor statement into chunks, similar to OpenMP. But after some research it does not seem to support parallelized queries.

Anyone have any idea on how to approach this problem?

Thanks!

EDIT: Example of XML structure

<Student Name="Kevin" friend="Alvin" BirthDate="1985-06-29" etc..>
  <More meta data> ....... />
</Student>

Solution

  • It seems one can consider that problem as a grouping problem where the members of a group have to be counted so you could try whether

    let 
      $friendsMap as map(xs:string, xs:integer) := 
        map:merge(
            for $student in $context
            group by $friend := $student/@Friend/string()
            return map { $friend : count($student) }
        )
    for $s in $context return <Student Name = '{$s/@Name}' NumberOfFriends = '{$friendsMap($s/@Name)}' />
    

    performs better, given that grouping is usually supported by the use of keys to make it more efficient.

    No idea whether it helps with BaseX and that particular problem but posting as an answer instead of a comment to have some readable way to suggest the code.

    The only other issue in your posted code snippets seems to be that the sample has an attribute spelled Friend while the XPath searches for @friend, not sure whether that is a typo in the question or perhaps the reason why the index doesn't work.