pythonpowershellencodinguipathwindows-1252

Why does my string containing "é" character gets outputed as "Ú"?


Here's the situation :

When I run my "script.py" file in any Powershell from my computer, the output I get is "Cédric" but when I run the script through UiPath, the output I get is "CÚdric". I understand that the issue is somehow related to the encoding.

After some researchs, I found out that running this Powershell command line [System.Text.Encoding]::Default.EncodingName, I get different results :

I found out that the HEX adress of "é" is E9 when using Windows-1252 encoding. But in CP850 encoding, E9 is "Ú". So I guess this is the encoding relation I'm looking for. THOUGH, I tried many things in UiPath (C#) and Powershell commands, but nothing did resolve my problem. (tried both changing encoding values or converting string into bytes to change encoding output)

And to anticipate some questions :

TLDR : Basically, the issue is located when UiPath interprets the Powershell console running the Python script

I've been stuck on that for 3 days now, only to get 2% more precise on the project I work (which is completely fine other than that); so it's not worth the time I spend on it, but I need to know


Solution

  • As for [System.Text.Encoding]::Default: That you're seeing UTF-8 as the value in UiPath implies that it is using PowerShell (Core) 7+ (pwsh.exe), the modern, install-on-demand, cross-platform edition built on .NET 5+, whereas Windows PowerShell (powershell.exe), the legacy, ships-with-Windows, Windows-only edition is built on .NET Framework.

    Note:


    There is an option to NOT require this configuration, by configuring Windows to use UTF-8 system-wide, as described in this answer, which sets both the active OEM and the active ANSI code page to 65001, i.e. UTF-8.


    [1] PowerShell-native commands and scripts, which run in-process, consistently communicate text via in-memory Unicode strings, due to using .NET strings, so no encoding problems can arise.
    When it comes to reading files, Windows PowerShell defaults to the ANSI code page when reading source code and text files with Get-Content, whereas PowerShell (Core) 7+ now - commendably - consistently defaults to UTF-8, also with respect to what encoding is used to write files - see this answer for more information.

    [2] Specifically, Python outputs byte 0xE9 meaning it to be character é, due to using Windows-1252 encoding. PowerShell, misinterprets this byte as referring to character Ú, because it decodes the byte as CP850, as reflected in [Console]::OutputEncoding. Compare [Text.Encoding]::GetEncoding(1252).GetString([byte[]] 0xE9) (-> é, whose Unicode code point is 0xE9 too, because Unicode is mostly a superset of Windows-1252) to [Text.Encoding]::GetEncoding(850).GetString([byte[]] 0xE9) (-> Ú, whose Unicode code point is 0xDA)

    [3] This applies when its stdout / stderr streams are connected to something other than a console, such when their output is captured by PowerShell.