I have an XML file that I'm trying to read with PowerShell. However when I read it, the output of some of the XML objects have the following characters in them: ​
I simply downloaded an XML file I needed from a third-party, which opens in Excel. Then I grab the columns I need and paste them into a new Excel Workbook. Then I map the fields with an XML Schema and then export it as an XML file, which I then use for scripting.
In the Excel spreadsheet my data looks clean, but then when I export it and run the PS script, these strange characters appear in the output. The characters even appear in the actual XML file after exporting. What am I doing wrong?
I tried using -Encoding UTF8
, but I'm relatively new to PowerShell and am not sure how to appropriately apply it to my script. Appreciate any help!
PowerShell
$xmlpath = 'Path\To\The\File.xml'
[xml]$xmldata = (Get-Content $xmlpath)
$xmldata.applications.application.name
Example of Output
​ABC_DEF_GHI​.com​​
​JKL_MNO_PQRS​.com​
TUV_WXY_Z.com
AB_CD_EF_GH​.com
This is a prime example of why you shouldn't use the idiom - as convenient as it is.[1] The problem is indeed one of character encoding: your file is UTF-8-encoded, but Windows PowerShell's [xml]$xmldata = (Get-Content $xmlpath)
Get-Content
cmdlet interprets it as ANSI-encoded in the absence of a BOM - this answer explains the encoding part in detail.Thanks, choroba.
Instead, to ensure that the XML file's character encoding is interpreted correctly, use the following:
# Note: If you know that $xmlPath contains a *full*, native path,
# you don't need the Convert-Path call.
($xmlData = [xml]::new()).Load((Convert-Path -LiteralPath $xmlPath))
This delegates interpretation of the character encoding to the System.Xml.XmlDocument.Load
.NET API method, which not only assumes the proper default for XML (UTF-8), but also respects any explicit encoding specification as part of the XML declaration, if present (e.g., <?xml version="1.0" encoding="iso-8859-1"?>
)
See also:
the bottom section of this answer for background information.
GitHub proposal #14505, which proposes introducing a New-Xml
cmdlet that robustly parses XML files.
[1] If you happen to know the encoding of the input file ahead of time, you can get away with using Get-Content
's -Encoding
parameter in your original approach ([xml]$xmldata = Get-Content -Encoding utf8 $xmlpath
, best to add -Raw
to Get-Content
for better performance), but only the .Load()
-based approach is robust by default.