powershellcharacter-encodingwindows-10youtube-dl

Powershell string variable with UTF-8 encoding


I checked many related questions about this, but I couldn't find something that solves my problem. Basically, I want to store a UTF-8 encoded string in a variable and then use that string as a file name.

For example, I'm trying to download a YouTube video. If we print the video title, the non-English characters show up (ytd here is youtube-dl):

./ytd https://www.youtube.com/watch?v=GWYndKw_zbw -e

Output: [LEEPLAY] 시티팝 입문 City Pop MIX (Playlist)

But if I store this in a variable and print it, the Korean characters are ignored:

$vtitle= ./ytd https://www.youtube.com/watch?v=GWYndKw_zbw -e

$vtitle

Output:[LEEPLAY] City Pop MIX (Playlist)


Solution

  • For a comprehensive overview of how PowerShell interacts with external programs, which includes sending data to them, see this answer.

    When PowerShell interprets output from external programs (such as ytd in your case), it assumes that the output uses the character encoding reflected in [Console]::OutputEncoding.

    Note:

    If the encoding reported by [Console]::OutputEncoding is not the same encoding used by the external program at hand, PowerShell misinterprets the output.

    To fix that, you must (temporarily) set [Console]::OutputEncoding] to match the encoding used by the external program.

    For instance, let's assume an executable foo.exe that outputs UTF-8-encoded text:

    # Save the current encoding and switch to UTF-8.
    $prev = [Console]::OutputEncoding
    [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
    
    # PowerShell now interprets foo's output correctly as UTF-8-encoded.
    # and $output will correctly contain CJK characters.
    $output = foo https://example.org -e
    
    # Restore the previous encoding.
    [Console]::OutputEncoding = $prev
    

    Important:


    With the specific program at hand, youtube-dl, js2010 has discovered that capturing in a variable works without extra effort if you pass --encoding utf-16.

    The reason this works is that the resulting UTF16-LE-encoded output is preceded by a BOM (Byte-Order Mark).

    (Note that --encoding utf-8 does not work, because youtube-dl then does not emit a BOM.)

    Windows PowerShell is capable of detecting and properly decoding UTF-16LE-encoded and UTF-8-encoded text irrespective of the effective [Console]::OutputEncoding] IF AND ONLY IF the output is preceded by a BOM.

    Caveats: