javascriptutf-8special-charactersdiacriticsunicode-normalization

Special character returns wrong codepoint


We have a problem with reading special characters from our database. We have character encoding configured as UTF-8 everywhere and the database seemingly stores all characters fine. For example we have a word with an é and it looks all good on screen. Also in the JSON response objects the character looks fine (with the accent) and it even seemingly works when we copy to clipboard and paste. But when we do strict comparison é === é we get false as a result. Turns out the stored returns charcode 101 which is e without the accent (not the expected char code 233).

I will demonstrate the problem even here in this code snippet: both characters look the same but one of them outputs 233 and the other 101.

// é = 233
// é = 101

console.log("é".charCodeAt());
console.log("é".charCodeAt());

Why is this happening? Apparently the fact that the character is an e with an accent is stored somehow correctly in the database (it looks correct), but not outputted as expected to the client. How can we enforce that the character is read correctly as an e with an accent as é (code 233)?


Solution

  • My answer was here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

    The second is actually 2 characters: the e and a combining character for the accent. The solution is to call .normalize() and then after that I can strict compare.

    // é = 233 (one single char)
    // é = 101 (actually 2 characters, first is "e")
    
    console.log("é".length);
    console.log("é".length);
    
    // Normalizing will return char 233 and then strict comparison will work.
    console.log("é".normalize() === "é");