stringpowershellutf-8oledbconnection

Convert a string in PowerShell (in Europe) to UTF-8


For a REST call I need the German "Stück" in UTF-8 as read from an access database with

$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source=$filename;Persist Security Info=False;")

and try to convert it. I have found out that PowerShell ISE seems to encode string constants in ANSI. So I tried as a minimum test without database and got the same result:

$Text1 = "Stück" # entered via ISE, this is also what I get from the database
# ($StringFromDatabase -eq $Test1) shows $true

$enc = [System.Text.Encoding]::GetEncoding(1252).GetBytes($Text1)
# also tried [System.Text.Encoding]::GetEncoding("ISO-8859-1") # = 28591

$Text1 = [System.Text.Encoding]::UTF8.GetString($enc)

$Text1
$Text1 = "Stück" # = UTF-8, entered here with Notepad++, encoding set to UTF-8
"must see: $Text1"

So I get two outputs - the converted one (showing "St?ck") but I need to see "Stück".


Solution

  • that PowerShell ISE seems to encode string constants in ANSI.

    That only applies when communicating with external programs, whereas you're using in-process .NET APIs.

    As an aside: this discrepancy with regular console windows, which use the active OEM code page is one of the reasons that make the obsolescent ISE problematic - see the bottom section of this answer for more information.

    String literals in memory are always .NET strings, which are UTF-16-encoded (composed of 16-bit Unicode code units), capable of representing all Unicode characters.[1]


    Character encoding in web-service calls (Invoke-RestMethod, Invoke-WebRequest):

    To send UTF-8 strings, specify charset=utf-8 as part of the -ContentType argument; e.g.:

    Invoke-RestMethod -ContentType 'text/plain; charset=utf-8' ...
    

    On receiving strings, PowerShell automatically decodes them either based on an explicitly specified charset field (character encoding) in the response's content header or, in its absence using ISO-8859-1 (which is closely related to, but in effect a subset of Windows-1252).


    Character encoding when communicating with external programs:

    If you need to send a string with a particular encoding to an external program (via the pipeline, which the target program receives via stdin), set the $OutputEncoding preference variable to that encoding, and PowerShell will automatically convert your .NET strings to the specified encoding.

    To send UTF-8-encoded strings to external programs via the pipeline:

    $OutputEncoding = [System.Text.UTF8Encoding]::new()
    

    Note, however, that this alone isn't sufficient in order to correctly receive UTF-8 output from external programs; for that, you need to set [Console]::OutputEncoding to the same encoding.

    To make your PowerShell session fully UTF-8-aware (irrespective of whether in the ISE or a regular console window):

    # Needed in the ISE only:
    chcp >$null # Dummy console-program call that ensures that a console is allocated.
    
    # Set all encodings relevant to communicating with external programs to UTF-8.
    $OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding =
      [System.Text.UTF8Encoding]::new()
    

    See this answer for more information.


    [1] Note, however, that Unicode characters with a code point greater than 0xFFFF, i.e. those outside the so-called BMP (Basic Multilingual Plane), must be represented with two 16-bit code units ([char]), namely so-called surrogate pairs.