pythonpython-3.xpowershellio-redirectionzero-width-space

Python Script Called in Powershell Fails to Write to Stdout when Piped to File


So I'm attempting to chain a couple scripts together, some in powershell (5.1), some in python (3.7).

The script that I am having trouble with is written in python, and writes to stdout via sys.stdout.write(). This script reads in a file, completes some processing, and then outputs the result.

When this script is called by itself, that is to say no output to any pipe, it properly executes and writes to the standard powershell console. However, as soon as I attempt to pipe the output in any fashion I start to get errors.

In particular, two files have the character \u200b, or a zero-width-space. Printing the output of these characters to the console is fine, but attempting to redirect the output to a file via a variety of methods:

py ./script.py input.txt > output.txt
py ./script.py input.txt | Set-Content -Encoding utf8 output.txt
Start-Process powershell -RedirectStandardOutput "output.txt" -Argumentlist "py", "./script.py", "input.txt"
$PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'

all fail with:

File "\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 61: character maps to <undefined>

On the python side, modifying the script to remove all non-UTF-8 characters also causes this script to fail, so I am a bit stuck. I am currently thinking that the issue is occurring due to how the piped output is causing python to set a different environment, though I am not sure how such modifications could be made within the python code.

For completeness sake, here is the function writing the output. (Note: file_lines is a list of strings):

import sys

def write_lines(file_lines):
    for line in file_lines:
        line = list(map(lambda x: '"' + x + '"', line))
        line = "".join(entry + ',' for entry in line)
        if not line is None:
            sys.stdout.write(line + "\n")

Solution

  • The root cause is with the way python handles STDOUT. Python does some low level detection to get the encoding of the system and then uses a io.TextIOWrapper with the encoding set to what it detects and that's what you get in sys.stdout (stderr and stdin have the same).

    Now, this detection returns UTF-8 when running in the shell because powershell works in UTF-8 and puts a layer of translation between the system and the running program but when piping to another program the communication is direct without the powershell translation, this direct communication uses the system's encoding which for windows is cp1252 (AKA Windows-1252).

    system <(cp1252)> posh <(utf-8)> python # here stdout returns to the shell
    system <(cp1252)> posh <(utf-8)> python <(cp1252)> pipe| or redirect> # here stdout moves directly to the next program
    

    As for your issue, without looking at the rest of your program and the input file my best guess is some encoding mismatch, most likely in the reading of the input file, by default python 3+ will read files in utf-8, if this file is in some other encoding you get errors, best case scenario you get garbage text, worst you get an encoding exception.

    To solve it you need to know which encoding your input file was created with, which may get tricky and detection is usually slow, other solution would be to work with the files in bytes but this may not be possible depending on the processing done.