powershellvariable-expansionfile-encodings

How to expand file content with powershell


I want to do this :

$content = get-content "test.html"
$template = get-content "template.html"
$template | out-file "out.html"

where template.html contains

<html>
  <head>
  </head>
  <body>
    $content
  </body>
</html>

and test.html contains:

<h1>Test Expand</h1>
<div>Hello</div>

I get weird characters in first 2 characters of out.html :

    ��

and content is not expanded.

How to fix this ?


Solution

  • To complement Mathias R. Jessen's helpful answer with a solution that:

    # Explicitly read the input files as UTF-8, as a whole.
    $content =  get-content -raw -encoding utf8 test.html
    $template = get-content -raw -encoding utf8 template.html
    
    # Write to output file using UTF-8 encoding *without a BOM*.
    [IO.File]::WriteAllText(
      "$PWD/out.html",
      $ExecutionContext.InvokeCommand.ExpandString($template)
    )
    

    Finally, the obligatory security warning: use this expansion technique only on input that you trust, given that arbitrary embedded commands may get executed.


    Optional background information

    PowerShell's Out-File, > and >> use UTF-16 LE character encoding with a BOM (byte-order mark) by default (the "weird characters", as mentioned).

    While Out-File -Encoding utf8 allows creating UTF-8 output files instead,
    PowerShell invariably prepends a 3-byte pseudo-BOM to the output file, which some utilities, notably those with Unix heritage, have problems with - so you would still get "weird characters" (albeit different ones).

    If you want a more PowerShell-like way of creating BOM-less UTF-8 files, see this answer of mine, which defines an Out-FileUtf8NoBom function that otherwise emulates the core functionality of Out-File.

    Conversely, on reading files, you must use Get-Content -Encoding utf8 to ensure that BOM-less UTF-8 files are recognized as such.
    In the absence of the UTF-8 pseudo-BOM, Get-Content assumes that the file uses the single-byte, extended-ASCII encoding specified by the system's legacy codepage (e.g., Windows-1252 on English-language systems, an encoding that PowerShell calls Default).

    Note that while Windows-only editors such as Notepad create UTF-8 files with the pseudo-BOM (if you explicitly choose to save as UTF-8; default is the legacy codepage encoding, "ANSI"), increasingly popular cross-platform editors such as Visual Studio Code, Atom, and Sublime Text by default do not use the pseudo-BOM when they create files.