powershellunicodeemojisurrogate-pairs

Spliting an emoji sequence in powershell


I have a text box that will be filled with emoji only. No spaces or characters of any kind. I need to split these emoji in order to identify them. This is what I have tried:

function emoji_to_unicode(){
    foreach ($emoji in $textbox.Text) {
        $unicode = [System.Text.Encoding]::Unicode.GetBytes($emoji)
        Write-Host $unicode
    }
}

Instead of printing the bytes one by one, the loop is running just once, printing the codes of all the emoji joined together. It's like all the emoji was a single item. I tested with 6 emoji, and instead of getting this:

61 216 7 222

61 216 67 222

61 216 10 222

61 216 28 222

61 216 86 220

60 216 174 223

I'm getting this:

61 216 7 222 61 216 67 222 61 216 10 222 61 216 28 222 61 216 86 220 60 216 174 223

What am I missing?


Solution

  • A string is just one element. You want to change it to a character array.

    foreach ($i in 'hithere') { $i }
    hithere
    
    foreach ($i in [char[]]'hithere') { $i }
    h
    i
    t
    h
    e
    r
    e
    

    Hmm this doesn't work well. These code points are pretty high, U+1F600 (32-bit), etc

    foreach ($i in [char[]]'😀😁😂😃😄😅😆') { $i }       
    �  # 16 bit surrogate pairs?
    �
    �
    �
    �
    �
    �
    �
    �
    �
    �
    �
    �
    �
    

    Hmm ok, add every pair. Here's another way to do it using https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates (or just use ConvertToUTF32($emoji, 0) )

    $emojis = '😀😁😂😃😄😅😆'
    for ($i = 0; $i -lt $emojis.length; $i += 2) {
      [System.Char]::IsHighSurrogate($emojis[$i])
      0x10000 + ($emojis[$i] - 0xD800) * 0x400 + $emojis[$i+1] - 0xDC00 | % tostring x
      # [system.char]::ConvertToUtf32($emojis,$i) | % tostring x  # or
      $emojis[$i] + $emojis[$i+1]
    }
    
    
    True
    1f600
    😀
    True
    1f601
    😁
    True
    1f602
    😂
    True
    1f603
    😃
    True
    1f604
    😄
    True
    1f605
    😅
    True
    1f606
    😆
    

    Note that unicode in the Unicode.GetBytes() method call refers to utf16le encoding.

    Chinese works.

    [char[]]'嗨,您好'
    嗨
    ,
    您
    好
    

    Here it is using utf32 encoding. All characters are 4 bytes long. Converting every 4 bytes into an int32 and printing them as hex.

    $emoji = '😀😁😂😃😄😅😆'
    $utf32 = [System.Text.Encoding]::utf32.GetBytes($emoji)
    
    for($i = 0; $i -lt $utf32.count; $i += 4) {
        $int32 = [bitconverter]::ToInt32($utf32[$i..($i+3)],0)
        $int32 | % tostring x
    }
    
    1f600
    1f601
    1f602
    1f603
    1f604
    1f605
    1f606
    

    Or going the other way from int32 to string. Simply casting the int32 to [char] does not work (have to add pairs of [char]'s). Script reference: https://www.powershellgallery.com/packages/Emojis/0.1/Content/Emojis.psm1

    for ($i = 0x1f600; $i -le 0x1f606; $i++ ) { [System.Char]::ConvertFromUtf32($i) }
    
    😀
    😁
    😂
    😃
    😄
    😅
    😆
    

    See also How to encode 32-bit Unicode characters in a PowerShell string literal?

    EDIT:

    Powershell 7 has a nice enumeraterunes() method:

    $emojis = '😀😁😂😃😄😅😆'
    $emojis.enumeraterunes() | % value | % tostring x
    
    1f600
    1f601
    1f602
    1f603
    1f604
    1f605
    1f606