duplicatesuniquexquerybasex

XQuery - Filter deep child nodes for duplicates


I am trying to remove duplicates on a lower level under my elements, as they can not be processed in the system. Unfortunately without much success so far.

The XML has several <Article> childs under <Articles>. The <Article> Elements can have <UNIT> Elements. These need to be unique in the whole document, but only the <NR>/<COUNT> combination.

With the Example as followed:

<Articles>
    <Article>
        <A1>123</A1>
        <A2>456</A2>
        <UNIT>
            <NR>59</NR>
            <COUNT>3</COUNT>
            <TEXT>RANDOM Aqfwfqf</TEXT>
        </UNIT>
        <UNIT>
            <NR>59</NR>
            <COUNT>3</COUNT>
            <TEXT>RANDOM hrthe</TEXT>
        </UNIT>
        <UNIT>
            <NR>59</NR>
            <COUNT>59</COUNT>
            <TEXT>RANDOM cutrh</TEXT>
        </UNIT>
    </Article>
    <Article>
        <A1>351</A1>
        <A2>362</A2>
        <UNIT>
            <NR>59</NR>
            <COUNT>4</COUNT>
            <TEXT>RANDOM rtjrtf</TEXT>
        </UNIT>
        <UNIT>
            <NR>59</NR>
            <COUNT>3</COUNT>
            <TEXT>RANDOM jrtj</TEXT>
        </UNIT>
        <UNIT>
            <NR>59</NR>
            <COUNT>59</COUNT>
            <TEXT>RANDOM rtjrt</TEXT>
        </UNIT>
    </Article>
</Articles>

The result should look like:

<Articles>
    <Article>
        <A1>123</A1>
        <A2>456</A2>
        <UNIT>
            <NR>59</NR>
            <COUNT>3</COUNT>
            <TEXT>RANDOM Aqfwfqf</TEXT>
        </UNIT>
        <UNIT>
            <NR>59</NR>
            <COUNT>59</COUNT>
            <TEXT>RANDOM cutrh</TEXT>
        </UNIT>
    </Article>
    <Article>
        <A1>351</A1>
        <A2>362</A2>
        <UNIT>
            <NR>59</NR>
            <COUNT>4</COUNT>
            <TEXT>RANDOM rtjrtf</TEXT>
        </UNIT>
    </Article>
</Articles>

I tried string-join the two values in <UNIT> and then delete the nodes, but ended up deleting all of the UNIT instead of leaving one.

Getting a distinct list and count the occurences worked, but i couldn't delete the excesss nodes.

How could i reduce the quantity of the node combination to one?


Solution

  • For me, the following works:

    declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
    
    declare option output:method 'xml';
    declare option output:indent 'yes';
    
    declare context item := document {
    <Articles>
        <Article>
            <A1>123</A1>
            <A2>456</A2>
            <UNIT>
                <NR>59</NR>
                <COUNT>3</COUNT>
                <TEXT>RANDOM Aqfwfqf</TEXT>
            </UNIT>
            <UNIT>
                <NR>59</NR>
                <COUNT>3</COUNT>
                <TEXT>RANDOM hrthe</TEXT>
            </UNIT>
            <UNIT>
                <NR>59</NR>
                <COUNT>59</COUNT>
                <TEXT>RANDOM cutrh</TEXT>
            </UNIT>
        </Article>
        <Article>
            <A1>351</A1>
            <A2>362</A2>
            <UNIT>
                <NR>59</NR>
                <COUNT>4</COUNT>
                <TEXT>RANDOM rtjrtf</TEXT>
            </UNIT>
            <UNIT>
                <NR>59</NR>
                <COUNT>3</COUNT>
                <TEXT>RANDOM jrtj</TEXT>
            </UNIT>
            <UNIT>
                <NR>59</NR>
                <COUNT>59</COUNT>
                <TEXT>RANDOM rtjrt</TEXT>
            </UNIT>
        </Article>
    </Articles>
    };
    
    
    . transform with {
        delete node for $unit in //UNIT 
                    group by $nr := $unit/NR, $cnt := $unit/COUNT
                    return subsequence($unit, 2)
      }
    

    So this is doing it on an in memory context node, I think if you have a db document as the input doing

        delete node for $unit in //UNIT 
                    group by $nr := $unit/NR, $cnt := $unit/COUNT
                    return subsequence($unit, 2)
    

    would work just fine.