javascript.netstringstring-comparisoncase-sensitive

What is the equivalent in JavaScript of comparing strings in .NET with ordinal ignore case?


In .NET we have the ability to compare strings using an ordinal comparison while ignoring the case. This is a best practice for string comparison especially when multiple cultures could be involved.

I'm looking for the exact equivalent in JavaScript. There are a ton of answers about JS string comparison but I couldn't find much about ordinal comparison, never mind doing that AND ignoring case. I did find this question about comparing strings in an ordinal manner without ignoring case, but it's not clear how I would do so while ignoring the case, so I think this is a fundamentally different question (hence my new question here).

How do you compare strings in JavaScript in an ordinal manner while ignoring case, just like in .NET?


Solution

  • It's hard to prove a negative (one can always overlook something), but as far as I can tell there's no direct equivalent operation built into the standard runtime. Straight-up ordinal comparison is done by < and >, as you note in the question. Other comparisons are done via localeCompare, but while it has options to ignore case, it works with locale-specific rules, which you've said you don't want. (I did wonder if there's a locale specifier for "ordinal," but if there is, I haven't found it.)

    Given that, the closest we can get is to convert both sides to the same capitalization and compare the result with < and >. The toLowerCase/toUpperCase operations use locale-insensitive case mappings:

    The result must be derived according to the locale-insensitive case mappings in the Unicode Character Database (this explicitly includes not only the file UnicodeData.txt, but also all locale-insensitive mappings in the file SpecialCasing.txt that accompanies it).

    So something along these lines:

    function ordinalCompareInsensitive(a, b) {
        const lowera = a.toLowerCase();
        const lowerb = b.toLowerCase();
        if (lowera === lowerb) {
            return 0;
        }
        if (lowera < lowerb) {
            return -1;
        }
        return 1;
    }
    

    While not falling prey to premature optimization, I'll note that in the rare case where the performance is a particular concern, the above is slightly biased toward assuming the strings match (since then it just does === but not < as well). If you had an edge case where the performance were critical, you might choose the order of comparisons to suit the data you're comparing.

    Regardless of the order of comparisons, the above does (of course) have to completely convert both strings before it starts. Unless the strings being compared are extraordinarily long, I wouldn't expect you to get any benefit out of avoiding that (in favor of doing the loop yourself, converting each character as you go, and short-circuiting as soon as you found the answer). But that would be an option if you found a use case for it. You'd have to decide whether to compare code units or to do the slightly-more-complicated thing of comparing code points. I'd probably go with code points as they're more meaningful, but it depends on your use case. The < and > operators work at the code unit level, but they don't have to worry about correct case mappings.

    Just for what it's worth, there are at least two ways to do it by code points: Using iterators, and using codePointAt. Here's an example of doing it with iterators (TypeScript, but with the type annotations commented out):

    // Again, I doubt you'd need to do your own loop for this, but just in case:
    function ordinalCompareInsensitive2(a/*: string*/, b/*: string*/)/*: number */ {
        const itA = a[Symbol.iterator]();
        const itB = b[Symbol.iterator]();
        let rA/*: IteratorResult<string, any>*/;
        let rB/*: IteratorResult<string, any>*/;
        while (true) {
            rA = itA.next();
            rB = itB.next();
            if (rA.done) {
                return rB.done ? 0 : -1;
            } else if (rB.done) {
                return 1;
            }
            const chA = rA.value.toLowerCase();
            const chB = rB.value.toLowerCase();
            if (chA < chB) {
                return -1;
            }
            if (chA > chB) {
                return 1;
            }
        }
    }
    

    Or with codePointAt (note that the index you pass is the index in code units, which is why the code moves the indexes on by the length of the character that was found [which may be multiple code units]):

    function ordinalCompareInsensitive3(a/*: string*/, b/*: string*/)/*: number */ {
        let indexA = 0;
        let indexB = 0;
    
        while (true) {
            if (indexA >= a.length) {
                return indexB >= b.length ? 0 : -1;
            } else if (indexB >= b.length) {
                return 1;
            }
            const chA = String.fromCodePoint(a.codePointAt(indexA)/*!*/).toLowerCase();
            const chB = String.fromCodePoint(b.codePointAt(indexB)/*!*/).toLowerCase();
            if (chA < chB) {
                return -1;
            }
            if (chA > chB) {
                return 1;
            }
            indexA += chA.length;
            indexB += chB.length;
        }
    }
    

    These are somewhat off-the-cuff and you'd want to audit them before using them, though I've tested them with some basic inputs:

    function ordinalCompareInsensitive(a/*: string*/, b/*: string*/)/*: number */ {
        const lowera = a.toLowerCase();
        const lowerb = b.toLowerCase();
        if (lowera === lowerb) {
            return 0;
        }
        if (lowera < lowerb) {
            return -1;
        }
        return 1;
    }
    
    function ordinalCompareInsensitive2(a/*: string*/, b/*: string*/)/*: number */ {
        const itA = a[Symbol.iterator]();
        const itB = b[Symbol.iterator]();
        let rA/*: IteratorResult<string, any>*/;
        let rB/*: IteratorResult<string, any>*/;
        while (true) {
            rA = itA.next();
            rB = itB.next();
            if (rA.done) {
                return rB.done ? 0 : -1;
            } else if (rB.done) {
                return 1;
            }
            const chA = rA.value.toLowerCase();
            const chB = rB.value.toLowerCase();
            if (chA < chB) {
                return -1;
            }
            if (chA > chB) {
                return 1;
            }
        }
    }
    
    function ordinalCompareInsensitive3(a/*: string*/, b/*: string*/)/*: number */ {
        let indexA = 0;
        let indexB = 0;
    
        while (true) {
            if (indexA >= a.length) {
                return indexB >= b.length ? 0 : -1;
            } else if (indexB >= b.length) {
                return 1;
            }
            const chA = String.fromCodePoint(a.codePointAt(indexA)/*!*/).toLowerCase();
            const chB = String.fromCodePoint(b.codePointAt(indexB)/*!*/).toLowerCase();
            if (chA < chB) {
                return -1;
            }
            if (chA > chB) {
                return 1;
            }
            indexA += chA.length;
            indexB += chB.length;
        }
    }
    
    function usLocaleCompareInsensitive(a/*: string*/, b/*: string*/)/*: number */ {
        return a.localeCompare(b, undefined, { sensitivity: "accent" });
    }
    
    function clampReturn(x/*: number */)/*: number */ {
        if (x === 0) {
            return x;
        }
        if (x < 0) {
            return -1;
        }
        return 1;
    }
    
    function test(a/*: string*/, b/*: string*/, expect/*: number*/) {
        const o1 = clampReturn(ordinalCompareInsensitive(a, b));
        const o2 = clampReturn(ordinalCompareInsensitive2(a, b));
        const o3 = clampReturn(ordinalCompareInsensitive3(a, b));
        const rUS = clampReturn(usLocaleCompareInsensitive(a, b));
        const result = o1 === o2 && o2 === o3 && o3 === rUS && rUS === expect ? "OK" : "<== ERROR !";
        console.log(`${a} vs. ${b}: ${o1} ${o2} ${o3} ${rUS} expect ${expect} ${result}`);
    }
    
    test("abc", "abc", 0);
    test("abc", "ABC", 0);
    test("abc", "aBC", 0);
    test("abc", "abcd", -1);
    test("abcd", "abc", 1);
    test("abc", "acd", -1);
    test("acd", "abc", 1);
    .as-console-wrapper {
        max-height: 100% !important;
    }


    I've used toLowerCase in the above (as opposed to toUpperCase) because I've found characters that, after being converted from lower case to upper case and then back again, are not the same as they started out; but I haven't found the opposite to be true (converting from upper case to lower case and back again). That said, this obsolete warning in Visual Studio 2015 (but not later) says the opposite, that you should convert to upper case. It's an edge case, but you have to pick one or the other, so I picked toLowerCase based on my findings. Pick your poison. :-)

    FWIW, here's how I checked:

    // Tell the in-snippet console not to throw away old log entries
    console.config({
        maxEntries: Infinity
    });
    
    const MAX_UNICODE = 0x10ffff;
    
    function ucode(ch/*: string*/) {
        return `\\u${ch
            .codePointAt(0)/*!*/
            .toString(16)
            .toUpperCase()
            .padStart(4, "0")}`;
    }
    
    function check(
        type/*: string*/,
        convertToMethod/*: "toLowerCase" | "toUpperCase"*/,
        convertBackMethod/*: "toLowerCase" | "toUpperCase"*/
    ) {
        console.log(
            `Checking '${type}' characters converted with ${convertToMethod} then back with ${convertBackMethod}:`
        );
        let found = 0;
        let mismatches = 0;
        for (let n = 1; n < MAX_UNICODE; ++n) {
            const char = String.fromCodePoint(n);
            const converted = char[convertToMethod]();
            if (char !== converted) {
                ++found;
                if (
                    char.localeCompare(converted, "en-US", {
                        sensitivity: "accent",
                    }) !== 0
                ) {
                    ++mismatches;
                    const reconverted = converted[convertBackMethod]();
                    console.log(
                        `${char} vs ${converted} vs ${reconverted} (${ucode(
                            char
                        )} vs. ${ucode(converted)} vs. ${ucode(reconverted)})`
                    );
                }
            }
        }
        console.log(
            `Done, ${mismatches} mis-matches found (total "${type}" chars found: ${found})`
        );
    }
    
    check("upper case", "toLowerCase", "toUpperCase");
    check("lower case", "toUpperCase", "toLowerCase");
    .as-console-wrapper {
        max-height: 100% !important;
    }

    That only checks each code point in isolation, though. In some languages, combinations of code points can be meaningful, but I haven't tried to allow for that in the simple test above. (But I did have it work at the code point level rather than just the code unit level.)