xquerybasex

XQuery tumbling window: group by start item of first window


Using BaseX 9.7.3, I have a sorted list of names that has been produced using a tumbling window clause.

A snippet of the data looks like this:

<data>
  <group>
    <key id="0c7b0bca-0349-489c-b45f-2612f3134a76">ovid</key>
    <key id="f77ab9c2-0be3-4348-809d-ab245e630f81">ovid 43 b c-17 or 18 a d</key>
  </group>
  <group>
    <key id="39b9d6c2-85a5-4c72-a83e-2a52e548fc3b">ovid 43 bc</key>
    <key id="acf5b3c0-8fd4-4e0c-950b-a40683bab431">ovid 43 bc-17 ad</key>
    <key id="cc57be53-9ca8-4b5e-97cf-1aeca798cded">ovid 43 bc-17 ad or 18 a</key>
    <key id="8395e750-1e52-4152-9d37-8c8f4e389fd3">ovid 43 bc-17 ad or 18 ad</key>
  </group>
  <group>
    <key id="0be07fc6-d9bf-4d56-8352-1885b4dd6574">ovid 43 bc-17 or 18</key>
    <key id="e3aafc69-56b0-4632-a96c-26ca448c6c2d">ovid 43 bc-17 or 18 ad</key>
  </group>
  <group>
    <key id="f9615365-4a32-442b-9e20-9c5abb0e6fa0">ovide</key>
    <key id="c7b45a8d-79a3-4e79-b32b-8d918f67a7b0">ovide 0043 av j-c-0017</key>
  </group>
</data>

I would like to further group the data so that, in this example, a group would begin with "ovid" and end with "ovid 43 bc-17 or 18 ad."

Desired output:

<data>
  <group>
    <key id="0c7b0bca-0349-489c-b45f-2612f3134a76">ovid</key>
    <key id="f77ab9c2-0be3-4348-809d-ab245e630f81">ovid 43 b c-17 or 18 a d</key>  
    <key id="39b9d6c2-85a5-4c72-a83e-2a52e548fc3b">ovid 43 bc</key>
    <key id="acf5b3c0-8fd4-4e0c-950b-a40683bab431">ovid 43 bc-17 ad</key>
    <key id="cc57be53-9ca8-4b5e-97cf-1aeca798cded">ovid 43 bc-17 ad or 18 a</key>
    <key id="8395e750-1e52-4152-9d37-8c8f4e389fd3">ovid 43 bc-17 ad or 18 ad</key>  
    <key id="0be07fc6-d9bf-4d56-8352-1885b4dd6574">ovid 43 bc-17 or 18</key>
    <key id="e3aafc69-56b0-4632-a96c-26ca448c6c2d">ovid 43 bc-17 or 18 ad</key>
  </group>
  <group>
    <key id="f9615365-4a32-442b-9e20-9c5abb0e6fa0">ovide</key>
    <key id="c7b45a8d-79a3-4e79-b32b-8d918f67a7b0">ovide 0043 av j-c-0017</key>
  </group>
</data>

I have the following query, but it simply reproduces the input document:

<data>{
  for tumbling window $entry in /*/group/key  
  start $s at $sp previous $sprev next $snext when starts-with($snext, $s)
  end $e at $ep next $enext when not(starts-with($enext, $e)) 
  return  
    <group>{
      for $k in $entry
      return (
        <key id="{$k/@id}">{data($k)}</key>
      )      
    }</group>         
}</data>

Is it possible to compare the start item of the first group ("ovid") to subsequent entries that start with that token? I want to exclude "ovide," even though it starts with "ovid."


Solution

  • With extended (Java like) regular expressions as supported in Saxon I think

    for tumbling window $w in /data/group/key
    start $s when true()
    end next $n when not(matches($n, '^' || $s || '\b', ';j'))
    return 
      <group>{$w}</group>
    

    gives the two groups you want.

    I have now also checked that the ';j' flag works with BaseX 9.7.2 as well.