powershellcharacter-encoding

How to pass UTF-8 characters to clip.exe with PowerShell without conversion to another charset?


I'm a Windows and Powershell noobie. I'm coming from Linux Land. I used to have this little Bash function in my .bashrc that would copy a "shruggie" (¯\_(ツ)_/¯) to the clipboard for me so that I could paste it into conversations on Slack and such.

My Bash alias looked like this: alias shruggie='printf "¯\_(ツ)_/¯" | xclip -selection c && echo "¯\_(ツ)_/¯"'

I realize that this question is juvenile, but the answer does have value to me as I'm sure that I will need to pipe odd UTF-8 characters to output in a Powershell script at some point in the future.

I wrote this function in my PowerShell profile:

function shruggie() {
  '¯\_(ツ)_/¯' | clip
  Write-Host '¯\_(ツ)_/¯ copied to clipboard.' -foregroundcolor yellow
}

However, this gives me: ??\_(???)_/?? (Unknown UTF-8 chars are converted to ?) when I call it on the command line.

I've looked at [System.Text.Encoding]::UTF8 and some other questions but I don't know how to cast my string as UTF-8 and pass that through clip.exe and receive UTF-8 out on the other side (on the clipboard).


Solution

  • There are two distinct, independent aspects:

    Prerequisite: PowerShell must properly recognize your source code's encoding in order for the solutions below to work: if your source code is UTF-8-encoded, be sure to save the enclosing files as UTF-8 with BOM for Windows PowerShell to recognize it.


    Copying ¯\_(ツ)_/¯ to the clipboard, using clip.exe:

    function shruggie() {
      $OutputEncoding = (New-Object System.Text.UnicodeEncoding $False, $False).psobject.BaseObject
      '¯\_(ツ)_/¯' | clip
      Write-Verbose -Verbose "Shruggie copied to clipboard." # see section about console output
    }
    

    Writing ¯\_(ツ)_/¯ to the console:

    Note: PowerShell Core on Unix platforms generally uses consoles (terminals) with a default encoding of (BOM-less) UTF-8, so no additional work is needed there.

    To merely echo (print) Unicode characters (beyond the 8-bit range), it is sufficient to switch to a font that can display Unicode characters (beyond the extended ASCII range), because, as PetSerAl points out, PowerShell uses the Unicode version of the WriteConsole Windows API function to print to the console.

    To support (most) Unicode characters, you most switch to one of the "TT" (TrueType) fonts.

    PetSerAl points out in a comment that console windows on Windows are currently limited to a single 16-bit code unit per output character (cell); given that only (most of) the characters in the BMP (Basic Multilingual Plane) are self-contained 16-bit code units, the (rare) characters beyond the BMP cannot be represented.

    Sadly, even that may not be enough for some (BMP) Unicode characters, given that the Unicode standard is versioned and font representations / implementations may lag.

    Indeed, as of Windows 10 release ID 1703, only a select few fonts can render (Unicode character KATAKANA LETTER TU, U+30C4, UTF-8: E3 83 84):


    Note that if you want to (also) change how other applications interpret such output, you must again set $OutputEncoding:

    For instance, to make PowerShell expect UTF-8 input from external utilities as well as output UTF-8-encoded data to external utilities, use the following:

    $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
    

    The above implicitly changes the code page to 65001 (UTF-8), as reflected in chcp (chcp.com).

    Note that, for backward compatibility, Windows console windows still default to the single-byte, extended-ASCII legacy OEM code page, such as 437 on US-English systems.

    Unfortunately, as of v6.0.0-rc.2, this also applies to PowerShell Core, even though it has otherwise switched to BOM-less UTF-8 as the default encoding, as also reflected in $OutputEncoding.