javascriptjsonregexunicode

How to convert unicode characters into corresponding emojis?


I'm doing something similar to this website with my data. I have the Unicode in the format below, and the code to convert UTF16 into UTF string works.

function decodeFBEmoji (fbString) {
  // Convert String to Array of hex codes
  const codeArray = (
    fbString  // starts as '\u00f0\u009f\u0098\u00a2'
    .split('')
    .map(char => (
      char.charCodeAt(0)  // convert '\u00f0' to 0xf0
    )
  );  // result is [0xf0, 0x9f, 0x98, 0xa2]

  // Convert plain JavaScript array to Uint8Array
  const byteArray = Uint8Array.from(codeArray);

  // Decode byte array as a UTF-8 string
  return new TextDecoder('utf-8').decode(byteArray);  // '😢'

I am trying to extract the Unicode from the text string, and then replace it with its decoded Unicode to display as a proper emoji. I tried to use regex to extract the Unicode string, however, it converts to the random garbage character, and regex results out null. How can I replace the given code with its emoji without changing the rest of the text?

function replaceEmoji(text){
      let str = "lorem ipsum lorem ipsum \u00e2\u009d\u00a4\u00ef\u00b8\u008f lorem ipsum"; 
      let res = str.match(/[\\]\w+/g); 
      console.log(str);
      console.log(res); //Result is null
}

Console output of the above code

Edit: Regex Pattern I tested


Solution

  • You're trying to decode some UTF8 but you're mixing up JS string escapes and bytes.

    When you type \uXXXX you type an escape for a unicode codepoint (just like \n is an escape for a newline), so this is true for instance: "\u0041" == "A"

    This is the reason your regex cannot match anything, there is actually no backslash \ in the string. Now it's not clear in what form you have your UTF8 coming in, but if it is like you wrote it it is a series of UTF8 bytes which need to be decoded like so:

    const utf8 = new Uint8Array(
        Array.prototype.map.call(
            "lorem ipsum lorem ipsum \u00e2\u009d\u00a4\u00ef\u00b8\u008f lorem ipsum", 
            c => c.charCodeAt(0)
        )
    );
    console.log(new TextDecoder('utf8').decode(utf8));