powershellencodingcmdpipe

Different behaviour and output when piping in CMD and PowerShell


I am trying to pipe the content of a file to a simple ASCII symmetrical encryption program i made. It's a simple program that reads input from STDIN and adds or subtracts a certain value (224) to each byte of the input. For example: if the first byte is 4 and we want to encrypt, then it becomes 228. If it exceeds 255, the program just performs some modulo.

This is the output I get with cmd (test.txt contains "this is a test"):

    type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt
    this is a test

It also works the other way, thus it is a symmetrical encryption algorithm

    type .\test.txt | .\Crypt.exe --decrypt | .\Crypt.exe --encrypt
    this is a test

But, the behaviour on PowerShell is different. When encrypting first, I get:

    type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt
    this is a test_*

And that is what I get when decrypting first:

Screen Shot

Maybe is an encoding problem. Thanks in advance.


Solution

  • tl;dr:


    For raw byte handling in Windows PowerShell and PowerShell v7.3-, shell out to cmd with /c (on Windows; on Unix-like platforms / Unix-like Windows subsystems, use sh or bash with -c):

    cmd /c 'type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt'
    

    Use a similar technique to save raw byte output in a file - do not use PowerShell's > operator:

    cmd /c 'someexe > file.bin'
    

    Note that if you want to capture an external program's text output in a PowerShell variable or process it further in a PowerShell pipeline, you need to make sure that [Console]::OutputEncoding matches your program's output character encoding (the active OEM code page, typically), which should be true by default in this case; see the next section for details.

    Generally, however, byte manipulation of text data is best avoided.


    There are two separate problems, only one of which has a simple solution:


    Problem 1: There is indeed a character encoding problem, as you suspected:

    PowerShell invisibly inserts itself as an intermediary in pipelines, even when sending data to and receiving data from external programs: It converts data from and to .NET strings (System.String), which are sequences of UTF-16 code units.

    In order to send to and receive data from external programs (such as Crypt.exe in your case), you need to match their character encoding; in your case, with a Windows console application that uses raw byte handling, the implied encoding is the system's active OEM code page.

    To fix your primary problem, you therefore need to set $OutputEncoding to the active OEM code page:

    # Make sure that PowerShell uses the OEM code page when sending
    # data to `.\Crypt.exe`
    $OutputEncoding = [Console]::OutputEncoding
    

    Problem 2: PowerShell invariably appends a trailing newline to data that doesn't already have one when piping data to external programs:

    That is, "foo" | .\Crypt.exe doesn't send (the $OutputEncoding-encoded bytes representing) "foo" to .\Crypt.exe's stdin, it sends "foo`r`n" on Windows; i.e., a (platform-appropriate) newline sequence (CRLF on Windows) is automatically and invariably appended (unless the string already happens to have a trailing newline).

    This problematic behavior is discussed in GitHub issue #5974 and also in this answer.

    In your specific case, the implicitly appended "`r`n" is also subject to the byte-value-shifting, which means that the 1st Crypt.exe calls transforms it to -*, causing another "`r`n" to be appended when the data is sent to the 2nd Crypt.exe call.

    The net result is an extra newline that is round-tripped (the intermediate -*), plus an encrypted newline that results in φΩ).


    In short: If your input data had no trailing newline, you'll have to cut off the last 4 characters from the result (representing the round-tripped and the inadvertently encrypted newline sequences):

    # Ensure that .\Crypt.exe output is correctly decoded.
    $OutputEncoding = [Console]::OutputEncoding
    
    # Invoke the command and capture its output in variable $result.
    # Note the use of the `Get-Content` cmdlet; in PowerShell, `type`
    # is simply a built-in *alias* for it.
    $result = Get-Content .\test.txt | .\Crypt.exe --decrypt | .\Crypt.exe --encrypt
    
    # Remove the last 4 chars. and print the result.
    $result.Substring(0, $result.Length - 4)
    

    Given that calling cmd /c as shown at the top of the answer works too, that hardly seems worth it.


    How PowerShell handles pipeline data with external programs:

    Note: The following mostly applies to v7.4+ as well, except where noted. (PowerShell) v7.3- is shorthand for both older PowerShell (Core) versions (7.3.x and below) and Windows PowerShell.

    Unlike cmd (or POSIX-like shells such as bash):

    Specifically, this works as follows:

    In Windows PowerShell and PowerShell (Core) up to v7.3.x only, the above also applies:

    In v7.4+, PowerShell now streams raw bytes in the two scenarios above, which not only improves performance noticeably, but prevents potential data corruption due to the previous as-text interpretation.


    Note that capturing raw byte data from external programs in memory isn't directly possible: on assignment to a variable or on processing via a PowerShell command, the as-text interpretation still invariably applies; the simplest workaround is:


    [1] In PowerShell (Core), given that $OutputEncoding commendably already defaults to UTF-8, it would make sense to have [Console]::OutputEncoding be the same - i.e., for the active code page to be effectively 65001 on Windows, as suggested in GitHub issue #7233.

    [2] With input from a file, the closest you can get to raw byte handling is to read the file as a .NET System.Byte array with Get-Content -AsByteStream (PowerShell (Core)) / Get-Content -Encoding Byte (Windows PowerShell), but the only way you can further process such as an array is to pipe to a PowerShell command that is designed to handle a byte array, or by passing it to a .NET type's method that expects a byte array. If you tried to send such an array to an external program via the pipeline, each byte would be sent as its decimal string representation on its own line.

    [3] Unicode is the name of the abstract standard describing a "global alphabet". In concrete use, it has various standard encodings, UTF-8 and UTF-16 being the most widely used.