xsltsaxonicuxslt-3.0uca

Sort strings, treating hyphen, slash, and space as equal, using UCA collation


Problem

I'm using Saxon-EE 11 and my platform's language is en-us.

I'm attempting to implement custom sorting behavior for an <xsl:sort> instruction by specifying a UCA collation. Ignoring the XML document details and just getting to the core, string-by-string comparison question, I want these strings:

ABSENTEES
ABSENTEE VOTING
MINNEAPOLIS TEACHERS RETIREMENT FUND ASSOCIATION (MTRFA)
MINNEAPOLIS-SAINT PAUL INTERNATIONAL AIRPORT
MINNEAPOLIS/SAINT PAUL HOUSING FINANCE BOARD
MINNEAPOLIS
MINNEAPOLIS PORT AUTHORITY

to be sorted into this order:

ABSENTEE VOTING
ABSENTEES
MINNEAPOLIS
MINNEAPOLIS PORT AUTHORITY
MINNEAPOLIS/SAINT PAUL HOUSING FINANCE BOARD
MINNEAPOLIS-SAINT PAUL INTERNATIONAL AIRPORT
MINNEAPOLIS TEACHERS RETIREMENT FUND ASSOCIATION (MTRFA)

Attempting to render the rules into English:

  1. A string that shares a common prefix with another string, but diverges at a space should sort before that other string (ABSENTEE VOTING before ABSENTEES)
  2. Hyphens and slashes should be considered the same as spaces.

What I've tried

The UCA collation http://www.w3.org/2013/collation/UCA?alternate=shifted handles the MINNEAPOLIS* strings correctly, but it will put ABSENTEES before ABSENTEE VOTING.

The bare UCA collation http://www.w3.org/2013/collation/UCA handles ABSENTEES and ABSENTEE VOTING correctly, but will place the MINNEAPOLIS/SAINT PAUL and MINNEAPOLIS-SAINT PAUL strings after anything with MINNEAPOLIS and a space character.

I've attempted a few other combinations of parameters, though none of them has produced anything closer to what I'm looking for. I'm close to giving up and implementing either a custom pre-processing before applying the collation or else dropping into a Java implementation.

If what I'm looking for is truly not achievable with UCA collations, that's good to know.


Solution

  • Using an input of:

    XML

    <root>
        <string>ABSENTEES</string>
        <string>ABSENTEE VOTING</string>
        <string>MINNEAPOLIS TEACHERS RETIREMENT FUND ASSOCIATION (MTRFA)</string>
        <string>MINNEAPOLIS-SAINT PAUL INTERNATIONAL AIRPORT</string>
        <string>MINNEAPOLIS/SAINT PAUL HOUSING FINANCE BOARD</string>
        <string>MINNEAPOLIS</string>
        <string>MINNEAPOLIS PORT AUTHORITY</string>
    </root>
    

    and the following stylesheet:

    XSLT 2.0

    <xsl:stylesheet version="2.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes"/>
    
    <xsl:template match="/root">
        <output>
            <xsl:perform-sort select="string">
                <xsl:sort select="translate(., '-/', '  ')"/>
            </xsl:perform-sort>
        </output>
    </xsl:template>
    
    </xsl:stylesheet>
    

    I get:

    Result

    <?xml version="1.0" encoding="UTF-8"?>
    <output>
       <string>ABSENTEE VOTING</string>
       <string>ABSENTEES</string>
       <string>MINNEAPOLIS</string>
       <string>MINNEAPOLIS PORT AUTHORITY</string>
       <string>MINNEAPOLIS/SAINT PAUL HOUSING FINANCE BOARD</string>
       <string>MINNEAPOLIS-SAINT PAUL INTERNATIONAL AIRPORT</string>
       <string>MINNEAPOLIS TEACHERS RETIREMENT FUND ASSOCIATION (MTRFA)</string>
    </output>