javascriptstringunicodecodepoint

Iterate over indices of codepoints in JavaScript string


How can one iterate over the indices of codepoints and their values in a JavaScript string?

For example, [...codepointIndices("H๐Ÿello")] should output:

[[0, "H"], [1, "๐Ÿ"], [3, "e"], [4, "l"], [5, "l"], [6, "o"]]

One can iterate over codepoints in JavaScript with String.prototype[Symbol.iterator], but there does not appear to be any built-in way to include the indices of each codepoint.


Solution

  • We can accumulate the codepoint lengths:

    function* codepointIndices(s) {
        let i = 0;
        for (let c of s) {
            yield [i, c];
            i += c.length;
        }
    }
    

    This works because String.prototype[Symbol.iterator] iterates over the codepoints of the string. Then in order to get the indices, we need to accumulate the length of each codepoint as we go. Most implementations of JavaScript use UTF-16 for strings, so the codepoints will range in length from 1 to 2 characters.

    If you want an array from this generator, simply call it like so:

    let array = [...codepointIndices(s)];