javascriptunicodeemojisurrogate-pairs

What unicode character (emoji) it was?


I have that string in my text file: ├░┬č┬Ź┬ć

What is known is that it was emoji or at least some surrogate character/character created by javascript string of length 2 or 4

Because of some reason it end up in that form. (It was obtained from mysql database which is utf8_general_ci and by node.js/mysql2/connection with charset latin1_swedish_ci)

How can I find what emoji it was? Is it possible?

Other examples:

├░┬č┬ĺ┬Ž ├░┬č┬ś┬ł ├░┬č┬ą┬Á

Algorithm written in JS would be best option.


Solution

  • It's double mojibake as shown in the following python code snippet (sorry, I cannot give Javascript equivalent):

    print('🍆 💦 😈 🥵'.
          encode('utf-8').decode('latin1').  # 1st mojibake stage
          encode('utf-8').decode('cp852')    # 2nd mojibake stage
        )                                    # ├░┬č┬Ź┬ć ├░┬č┬ĺ┬Ž ├░┬č┬ś┬ł ├░┬č┬ą┬Á
    

    Possible repair (although prevention is better than cure):

    print('├░┬č┬Ź┬ć ├░┬č┬ĺ┬Ž ├░┬č┬ś┬ł ├░┬č┬ą┬Á'.
          encode('cp852').decode('utf-8').       # fix 2nd mojibake stage
          encode('latin1').decode('utf-8')       # fix 1st mojibake stage
        )                                        # 🍆 💦 😈 🥵
    

    FYI, those emojis are (column CodePoint contains Unicode (U+hhhh) and UTF-8 bytes; column Description contains surrogate pairs in parentheses):

    Char CodePoint                      Description
    ---- ---------                      -----------
    🍆   {U+1F346, 0xF0,0x9F,0x8D,0x86} AUBERGINE               (0xd83c,0xdf46)
    💦   {U+1F4A6, 0xF0,0x9F,0x92,0xA6} SPLASHING SWEAT SYMBOL  (0xd83d,0xdca6)
    😈   {U+1F608, 0xF0,0x9F,0x98,0x88} SMILING FACE WITH HORNS (0xd83d,0xde08)
    🥵   {U+1F975, 0xF0,0x9F,0xA5,0xB5} OVERHEATED FACE         (0xd83e,0xdd75)