javascriptunicodearray-splicemultibyte-characters

Text replacements with splice do not work with smiles (or multibyte chars)


I have a problem with a complex replacement algorithm. In the end I was able to reduce the problem to this minimal code:

const input="test 🙄 hello test world"
let start = 0
let output = [...input]
const replacements = []
for (let end = 0; end <= input.length; end++) {
    const c = input[end]
    if (c == ' ') {
        if (start !== end) {
            const word = input.substring(start, end).toLowerCase()
            if (word == 'test') {
                replacements.push({start, length:(end - start), text:'REPLACEMENT'})
            }
        }
        start = end + 1
    }
}
for(let i=replacements.length-1;i>=0;i--) {
    output.splice(replacements[i].start, replacements[i].length, replacements[i].text)
}
console.log(output.join(''))

My input is "test 🙄 hello test world" and the expected output would be "REPLACEMENT 🙄 hello REPLACEMENT world", but it is actually "REPLACEMENT 🙄 hello tREPLACEMENTworld". I can remember from the Twitter API that JavaScript has a strange way to handle byte positions and char indices. So the issue is caused oblicious by the smiley.

How can I fix my code, so that the replacement works as expected? Bonus question why is that happening?


Solution

  • Well that was quick:

    const input="test 🙄 hello test world"
    let start = 0
    let output = [...input]
    const replacements = []
    for (let end = 0; end <= output.length; end++) {
        const c = output[end]
        if (c == ' ') {
            if (start !== end) {
                const word = output.slice(start, end).join('').toLowerCase()
                if (word == 'test') {
                    replacements.push({start, length:(end - start), text:'REPLACEMENT'})
                }
            }
            start = end + 1
        }
    }
    for(let i=replacements.length-1;i>=0;i--) {
        output.splice(replacements[i].start, replacements[i].length, replacements[i].text)
    }
    console.log(output.join(''))
    

    When I use output array as input the indices work as expected and my replacement works again. However I will give anyone the accepted state who can explain why that change is required.