excelxmlpowershell

Strange characters found in XML file and PowerShell output after exporting from Excel: ​


I have an XML file that I'm trying to read with PowerShell. However when I read it, the output of some of the XML objects have the following characters in them: ​

I simply downloaded an XML file I needed from a third-party, which opens in Excel. Then I grab the columns I need and paste them into a new Excel Workbook. Then I map the fields with an XML Schema and then export it as an XML file, which I then use for scripting.

In the Excel spreadsheet my data looks clean, but then when I export it and run the PS script, these strange characters appear in the output. The characters even appear in the actual XML file after exporting. What am I doing wrong?

I tried using -Encoding UTF8, but I'm relatively new to PowerShell and am not sure how to appropriately apply it to my script. Appreciate any help!

PowerShell

$xmlpath = 'Path\To\The\File.xml'

[xml]$xmldata = (Get-Content $xmlpath)

$xmldata.applications.application.name

Example of Output

​ABC_DEF_GHI​.com​​
​JKL_MNO_PQRS​.com​
TUV_WXY_Z.com
AB_CD_EF_GH​.com

Solution

  • This is a prime example of why you shouldn't use the idiom
    [xml]$xmldata = (Get-Content $xmlpath) - as convenient as it is.[1] The problem is indeed one of character encoding: your file is UTF-8-encoded, but Windows PowerShell's Get-Content cmdlet interprets it as ANSI-encoded in the absence of a BOM - this answer explains the encoding part in detail.Thanks, choroba.

    Instead, to ensure that the XML file's character encoding is interpreted correctly, use the following:

    # Note: If you know that $xmlPath contains a *full*, native path,
    #       you don't need the Convert-Path call.
    ($xmlData = [xml]::new()).Load((Convert-Path -LiteralPath $xmlPath))
    

    This delegates interpretation of the character encoding to the System.Xml.XmlDocument.Load .NET API method, which not only assumes the proper default for XML (UTF-8), but also respects any explicit encoding specification as part of the XML declaration, if present (e.g.,
    <?xml version="1.0" encoding="iso-8859-1"?>)

    See also:


    [1] If you happen to know the encoding of the input file ahead of time, you can get away with using Get-Content's -Encoding parameter in your original approach ([xml]$xmldata = Get-Content -Encoding utf8 $xmlpath, best to add -Raw to Get-Content for better performance), but only the .Load()-based approach is robust by default.