powershellutf-8powershell-4.0byte-order-mark

How to cat a UTF-8 (no BOM) file properly/globally in PowerShell? (to another file)


Create a file utf8.txt. Ensure the encoding is UTF-8 (no BOM). Set its content to

In cmd.exe:

type utf8.txt > out.txt

Content of out.txt is

In PowerShell (v4):

cat .\utf8.txt > out.txt

or

type .\utf8.txt > out.txt

Out.txt content is €

How do I globally make PowerShell work correctly?


Solution

  • Note: This answer is about Windows PowerShell (up to v5.1); PowerShell (Core) 7+, the cross-platform edition of PowerShell, now fortunately consistently defaults to BOM-less UTF-8 on both in- and output.


    Windows PowerShell, unlike the underlying .NET Framework[1] , uses the following defaults:

    File-consuming and -producing cmdlets do usually support an -Encoding parameter that lets you specify the encoding explicitly.
    Prior to Windows PowerShell v5.1, using the underlying Out-File cmdlet explicitly was the only way to change the encoding.
    In Windows PowerShell v5.1+, > and >> became effective aliases of Out-File, allowing you to change the encoding behavior of > and >> via the $PSDefaultParameterValues preference variable; e.g.:
    $PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'.

    For Windows PowerShell to handle UTF-8 properly, you must specify it as both the input and output encoding[3] , but note that on output, PowerShell invariably adds a BOM to UTF-8 files.

    Applied to your example:

    Get-Content -Encoding utf8 .\utf8.txt | Out-File -Encoding utf8 out.txt
    

    To create a UTF-8 file without a BOM in PowerShell, see this answer.


    [1] .NET Framework uses (BOM-less) UTF-8 by default, both for in- and output.
    This - intentional - difference in behavior between Windows PowerShell and the framework it is built on is unusual. The difference went away in PowerShell [Core] v6+: both .NET [Core] and PowerShell [Core] default to BOM-less UTF-8.

    [2] This applies to Get-Content and notably also to source code read by the PowerShell engine. Unfortunately, the default behavior varies across cmdlets; for instance, Import-Csv assumes UTF-8. For an overview of the default character encoding used by all built-in cmdlets in Windows PowerShell, see the bottom section of this answer.

    [3] Cmdlets such as Get-Content do, however, automatically recognize UTF-8 files with a BOM, and so does the PowerShell engine when reading source code.