xmlpowershellhash

XmlReader from memory rather than file


I am looking for a way to load complex XML files with comments and return errors with line numbers, but also be able to hash the important XML after validation, so I don't have to validate at every single load. This is user generated XML after all. For the first part I can import the XML with white space and comments included so that an XML error reports correct line numbers, but I then need to do another import with whitespace and comments ignored for the hashing. I add a hash attribute to the root element at validation, which is slow. Then I can import, read the hash attribute and then delete it, then hash the remaining XML and compare the two hashes. The hashing is fast and allows me to only validate at use any file that has materially changed since last validation. I import without whitespace and comments because those are not material changes. The problem is that this is then two imports, which is inefficient and just plain ugly. I have tried re "importing" with white space and comments ignored, but directly from the full XML doc already loaded, using this function. But it seems like importing from memory doesn't ensure attribute order like importing from files does, so the hash fails when in fact nothing changed. Am I missing something, or is this just a tree that can't be climbed?

function reimportXmlForUse {
    param (
        [System.Xml.XmlDocument]$fullXmlDoc
    )

    $settings = [System.Xml.XmlReaderSettings]::new()
    $settings.IgnoreWhitespace = $true
    $settings.IgnoreComments = $true

    $xmlString = $fullXmlDoc.OuterXml
    $stringReader = [System.IO.StringReader]::new($xmlString)
    $reader = [System.Xml.XmlReader]::Create($stringReader, $settings)

    $cleanXmlDoc = [System.Xml.XmlDocument]::new()
    $cleanXmlDoc.PreserveWhitespace = $false
    $cleanXmlDoc.Load($reader)
    $reader.Close()

    return $cleanXmlDoc
}

EDIT: Full script added below. I start from an example XML file, which I import, hash, add the hash as an attribute, then save. The first test is to then immediately load that hashed file, remove the attribute and rehash. This two hashes SHOULD be the same, and when I imported the first file with white space and comments included, then imported again without that for hashing, but added the hash attribute to the first file, everything worked. It passed the first test, copied twice and waited until I changed a comment in one and a pertinent element in the other. But of course flagged changes in both elements and comments. Which is what I am trying to avoid. These files can be a few hundred lines long, and XML validation in PowerShell is slow as molasses. When a process is being kicked off on a few hundred machines at once, I really don't want a multiple minute delay just to validate XML. And I don't want to trigger validation because someone just changed or added a note. I only want to revalidate a change that will actually impact outcomes.

function importXmlForEditing {
    param ($filePath)
    
    if (-not (Test-Path $filePath)) {
        throw "File not found: $filePath"
    }
    
    $settings = [System.Xml.XmlReaderSettings]::new()
    $settings.IgnoreWhitespace = $false
    
    $reader = [System.Xml.XmlReader]::Create($filePath, $settings)
    
    $xmlDoc = [System.Xml.XmlDocument]::new()
    $xmlDoc.PreserveWhitespace = $true
    $xmlDoc.Load($reader)
    $reader.Close()
    
    return $xmlDoc
}

function cloneXmlForUse {
    param ([System.Xml.XmlDocument]$xmlDoc)

    $cloned = $xmlDoc.CloneNode($true)

    # Remove comments
    $comments = $cloned.SelectNodes("//comment()")
    foreach ($comment in $comments) {
        $null = $comment.ParentNode.RemoveChild($comment)
    }

    $cloned.DocumentElement.Normalize()
    return $cloned
}

function reimportXmlForUse {
    param (
        [System.Xml.XmlDocument]$fullXmlDoc
    )

    $settings = [System.Xml.XmlReaderSettings]::new()
    $settings.IgnoreWhitespace = $true
    $settings.IgnoreComments = $true

    $xmlString = $fullXmlDoc.OuterXml
    $stringReader = [System.IO.StringReader]::new($xmlString)
    $reader = [System.Xml.XmlReader]::Create($stringReader, $settings)

    $cleanXmlDoc = [System.Xml.XmlDocument]::new()
    $cleanXmlDoc.PreserveWhitespace = $false
    $cleanXmlDoc.Load($reader)
    $reader.Close()

    return $cleanXmlDoc
}

function exportXml {
    param ($xmlDoc, $filePath)

    $settings = [System.Xml.XmlWriterSettings]::new()
    $settings.Indent = $false
    $settings.OmitXmlDeclaration = $false
    $settings.Encoding = [System.Text.UTF8Encoding]::new($false)

    $stream = [System.IO.FileStream]::new($filePath, [System.IO.FileMode]::Create)
    $writer = [System.Xml.XmlWriter]::Create($stream, $settings)
    $xmlDoc.WriteTo($writer)
    $writer.Flush()
    $writer.Close()
    $stream.Close()
}

function compareHash {
   param ([System.Xml.XmlDocument]$xmlDoc)

    $savedHash = $xmlDoc.DocumentElement.GetAttribute("hash")
    if (-not $savedHash) {
        return $false
    }

    # Clone to avoid modifying the original
    $clone = $xmlDoc.CloneNode($true)

    # Remove the hash before computing the new one
    $clone.DocumentElement.RemoveAttribute("hash")
    $computedHash = computeHash $clone
    Write-Host "s: $savedHash"
    Write-Host "c: $computedHash"

    return ($computedHash -eq $savedHash)
}

function computeHash {
    param ($xmlDoc)

    $utf8Encoding = [System.Text.UTF8Encoding]::new($false)
    $ms = [System.IO.MemoryStream]::new()
    $writer = [System.Xml.XmlWriter]::Create($ms, [System.Xml.XmlWriterSettings]@{
        Indent = $false
        OmitXmlDeclaration = $false
        Encoding = $utf8Encoding
    })

    $xmlDoc.WriteTo($writer)
    $writer.Flush()
    $ms.Position = 0

    $reader = [System.IO.StreamReader]::new($ms, $utf8Encoding)
    $xmlString = $reader.ReadToEnd()
    $bytes = [System.Text.Encoding]::UTF8.GetBytes($xmlString)
    $sha256 = [System.Security.Cryptography.SHA256]::Create()
    $hashBytes = $sha256.ComputeHash($bytes)

    return [BitConverter]::ToString($hashBytes) -replace '-', ''
}

### validate
$unhashedPath = "$PSScriptRoot\Definitions_2022_RVT.xml"
$hashedPath = "$PSScriptRoot\Definitions_2022_RVT_HASHED.xml"
$changedElementPath = "$PSScriptRoot\Definitions_2022_RVT_CHANGED Element.xml"
$changedCommentPath = "$PSScriptRoot\Definitions_2022_RVT_CHANGED Comment.xml"

if ([IO.File]::Exists($hashedPath)) {
    [IO.File]::Delete($hashedPath)
}
if ([IO.File]::Exists($changedElementPath)) {
    [IO.File]::Delete($changedElementPath)
}
if ([IO.File]::Exists($changedCommentPath)) {
    [IO.File]::Delete($changedCommentPath)
}
$rawXmlDoc = importXmlForEditing $unhashedPath
$cleanXmlDoc = reimportXmlForUse $rawXmlDoc
$currentHash = computeHash $cleanXmlDoc
$rawXmlDoc.DocumentElement.SetAttribute("hash", $currentHash)
exportXml $rawXmlDoc $hashedPath


### verify that the hashed xml passes
$testXmlDoc = importXmlForEditing $hashedPath
$testXmlDoc = reimportXmlForUse $testXmlDoc
if (compareHash $testXmlDoc) {
    Write-Host "$hashedPath is valid"
    [IO.File]::Copy($hashedPath, $changedElementPath, $true)
    [IO.File]::Copy($hashedPath, $changedCommentPath, $true)
    Read-Host "Press Enter to continue"

    $changedElementXmlDoc = importXmlForEditing $changedElementPath
    $changedElementXmlDoc = reimportXmlForUse $changedElementXmlDoc
    if (compareHash $changedElementXmlDoc) {
        Write-Host "$changedElementPath is unchanged"
    } else {
        Write-Host "$changedElementPath changed!"
    }

    $changedCommentXmlDoc = importXmlForEditing $changedCommentPath
    $changedCommentXmlDoc = reimportXmlForUse $changedCommentXmlDoc
    if (compareHash $changedCommentXmlDoc) {
        Write-Host "$changedCommentPath is unchanged"
    } else {
        Write-Host "$changedCommentPath changed!"
    }

} else {
    
}

Edit 2: Xml I am using, for reference. It does give a feel for what a mess Autodesk uninstalls are. :)

<?xml version="1.0" encoding="utf-8"?>
<!-- Release History
     30.12.2021
-->
<Definitions>
    <Sets>
        <Set id="RVT_2022">
            <Rollout>ADSK</Rollout>
            <Rollout>RVT_2022</Rollout>
        </Set>
        <Set id="RVT_2022-X">
            <Remove>RVT_2022</Remove>
        </Set>
    </Sets>

    <Packages>
        <Package id="RVT_2022" product="RVT2022">
            <Rollout>
                <Machine>
                    <!-- Some sort of note about the install -->
                    <Install id="RVT_2022.0">
                        <InstallPathAndTarget>\\px\Rollouts\ADSK\2022\Revit_2022\Deployment\Revit_2022_N\image\Installer.exe</InstallPathAndTarget>
                        <FilePath>$(InstallPathAndTarget)</FilePath>
                        <ArgumentList>-i deploy -q --offline_mode -o "\\px\Rollouts\ADSK\2022\Revit_2022\Deployment\Revit_2022_N\image\Collection.xml" --installer_version "1.18.0.25" /norestart</ArgumentList>
                    </Install>
                    
                    <!-- Some sort of note about the Copy 
                     With multiple lines -->
                    <Copy>
                        <Source>\\px\Rollouts\ADSK\2022\Revit_2022\Deployment\Revit_2022_N\Seeds\reviticon_2022.ico</Source>
                        <Destination>[Product~Icon]</Destination>
                    </Copy>
                    <Copy>
                        <Source>\\px\Rollouts\ADSK\2022\Revit_2022\Deployment\Revit_2022_N\Seeds\reviticon_2022viewer.ico</Source>
                        <Destination>[Product~Icon_Viewer]</Destination>
                    </Copy>
                    <Copy id="UserDataCache Revit.ini">
                        <Source>\\px\Rollouts\ADSK\2022\Revit_2022\Seeds\Revit.ini</Source>
                        <Destination>[Product~UserDataCache]</Destination>
                    </Copy>

                    <Delete>[Windows~PublicDesktop]\[Product~Shortcut]</Delete>
                </Machine>
                <User>
                    <Copy id="User Revit.ini">
                        <Source>[Product~UserDataCache]\Revit.ini</Source>
                        <Destination>[Product~UserAppDataRoaming]</Destination>
                    </Copy>
                    <Copy id="Profile">
                        <Source>\\px\Rollouts\ADSK\2022\Revit_2022\Seeds\Profile.xml</Source>
                        <Destination>[Product~UserAppDataProfile]</Destination>
                    </Copy>
                    <Copy id="Desktop shortcut">
                        <Source>[Product~ShortcutPath]\[Product~Shortcut]</Source>
                        <Destination>[Windows~UserDesktop]</Destination>
                    </Copy>
                </User>
            </Rollout>
            <Remove>
                <Machine>
                    <Uninstall id="RVT_2022">
                        <File>C:\Program Files\Autodesk\AdODIS\V1\Installer.exe</File>
                        <Arguments>-i uninstall -q --trigger_point system -m C:\ProgramData\Autodesk\ODIS\metadata\{03BD6A4A-C858-3AD2-9353-DF2974C9918B}\bundleManifest.xml -x C:\ProgramData\Autodesk\ODIS\metadata\{03BD6A4A-C858-3AD2-9353-DF2974C9918B}\SetupRes\manifest.xsd</Arguments>
                    </Uninstall>
                    <!-- <Uninstall id="Autodesk Advanced Material Library Base Resolution Image Library 2022">{7E78B513-B354-4833-8897-3ED5C515D30F}</Uninstall> --> <!-- shared with Navisworks -->
                    <Uninstall id="Autodesk Advanced Material Library Low Resolution Image Library 2022">{EEAD8CC3-B6B7-4D4B-AF0D-4BBD3D93D67C}</Uninstall>
                    <Uninstall id="Autodesk Advanced Material Library Medium Resolution Image Library 2022">{493ACC3C-3ABF-4CBB-8F6E-E4433090A589}</Uninstall>
                    <!-- <Uninstall id="Autodesk Material Library 2022">{A9221A68-5AD0-4215-B54F-CB5DBA4FB27C}</Uninstall> --> <!-- shared with AutoCAD & Navisworks -->
                    <!-- <Uninstall id="Autodesk Material Library Base Resolution Image Library 2022">{6256584F-B04B-41D4-8A59-44E70940C473}</Uninstall> --> <!-- shared with AutoCAD & Navisworks -->
                    <Uninstall id="Autodesk Material Library Low Resolution Image Library 2022">{490259AE-1021-4BED-B74B-162151EC45C7}</Uninstall>
                    <Uninstall id="Autodesk Material Library Medium Resolution Image Library 2022">{8300AA3F-6ADF-4233-A1FB-73B1894102F0}</Uninstall>
                    
                    
                    <Uninstall id="OpenStudio CLI For Revit 2022">{7F84EE71-7DAF-4CEE-B063-FA3C931E1206}</Uninstall>
                    <Uninstall id="Autodesk Revit Unit Schemas 2022">{CDCC6F31-2022-4901-8E9B-D562B70697B6}</Uninstall> <!-- RVT 2022.0 -->
                    <Uninstall id="Autodesk Revit Unit Schemas 2022">{CDCC6F31-2022-4902-8E9B-D562B70697B6}</Uninstall> <!-- RVT 2022.0.1 -->
                    <Uninstall id="Autodesk Revit Unit Schemas 2022">{CDCC6F31-2022-4903-8E9B-D562B70697B6}</Uninstall> <!-- RVT 2022.1 -->
                    <Uninstall id="Autodesk Revit Unit Schemas 2022">{CDCC6F31-2022-4904-8E9B-D562B70697B6}</Uninstall> <!-- RVT 2022.1.1 -->
                    <!-- <Uninstall id="Autodesk Cloud Models for Revit 2022">{AA384BE4-2201-0010-0000-97E7D7D021A0}</Uninstall> -->
                    <!-- <Uninstall id="Autodesk Revit 2022">{03BD6A4A-C858-3AD2-9353-DF2974C9918B}</Uninstall> -->
                    <!-- <Uninstall id="Autodesk Revit Content Core 2022">{AA384BE4-2022-0410-0000-9241AD002DA5}</Uninstall> -->
                    <!-- <Uninstall id="Autodesk Revit Content Core-RVT 2022">{CC7D1ED0-2022-0410-0000-1CC925969102}</Uninstall> -->
                    <!-- <Uninstall id="Autodesk Revit Product Feedback 2022">{D0AA00F5-2022-4900-BB7C-21929DC2B241}</Uninstall> -->
                    <!-- <Uninstall id="FormIt Converter for Revit 2022">{211B8FB3-8E4C-4DBE-80EE-0BC47C3F5953}</Uninstall> --> <!-- 2022.0 & 2022.0.1 -->
                    <!-- <Uninstall id="FormIt Converter for Revit 2022">{211B8FB3-8E4C-4DBE-2210-0BC47C3F5953}</Uninstall> --> <!-- 2022.1 & 2022.1.1 -->
                    <!-- <Uninstall id="Generative Design For Revit">{1DE5450F-7EE9-47EB-A9E9-F89DC3168C4E}</Uninstall> -->
                    <!-- <Uninstall id="Personal Accelerator for Revit">{6E1DC831-145C-4FB6-97CC-714AB058D840}</Uninstall> -->
                    <!-- <Uninstall id="Results Explorer Manager">{AE1C056A-728A-44CC-863A-E52124941AA2}</Uninstall> -->
                    <!-- <Uninstall id="Revit 2022">{7346B4A0-2200-0510-0000-705C0D862004}</Uninstall> -->
                    <!-- <Uninstall id="REX Framework">{A24E5DBF-7C6F-4589-AE67-2D1049C4308E}</Uninstall> -->
                    <!-- <Uninstall id="REX Revit">{58373F9B-D120-4017-B361-94BDDEBB93DD}</Uninstall> -->
                    <!-- <Uninstall id="RSA COM">{41B1F45C-E6F6-4A50-916F-2C5EF4942BA1}</Uninstall> -->
                    <!-- <Uninstall id="RSA CommonData">{877A759F-2E2D-47D2-83E1-71690F4A2FC1}</Uninstall> -->
                    <!-- <Uninstall id="RSA Interop">{238555B3-F53C-4006-9E9B-0B0C44B2FDCE}</Uninstall> -->
                    <!-- <Uninstall id="RSA RoReinf">{C2B239AC-FB15-47D7-9EAD-3CC84A2A9AF3}</Uninstall> -->
                    <Delete>[Product~ProgramDataFolder]</Delete>
                </Machine>
                <User>
                    <Delete>[Product~UserAppDataLocal]</Delete>
                    <Delete>[Product~UserAppDataRoaming]</Delete>
                    <Delete>[Product~UserAppDataAddins]</Delete>
                    <Delete>C:\Users\[Px~UserName]\AppData\Roaming\Autodesk\ADPSDK\RVT\2022</Delete>
                    <Delete>HKCU\Software\Autodesk\GenerativeDesign 2022</Delete>
                    <Delete>HKCU\Software\Autodesk\Revit\2022</Delete>
                    <Delete>HKCU\Software\Autodesk\Revit\Autodesk Revit 2022</Delete>
                    <Delete>HKCU\Software\Autodesk\Revit Precast Tools 2022</Delete>
                    <Delete>[Windows~UserDesktop]\[Product~Shortcut]</Delete>
                </User>
            </Remove>
            <Relocate>
                <Machine>
                </Machine>
                <User>
                </User>
            </Relocate>
        </Package>
    </Packages>
</Definitions>

Solution

  • If you're concerned with unnecessarily reading and parsing the same document multiple times, what you could do is:

    Since [XmlWriterSettings] doesn't have an OmitComments option, you'll have to hide the comments in some other fashion.

    One option is to extend the XmlTextWriter type by overriding XmlWriter.WriteComment with an empty method - here using PowerShell classes:

    class NoCommentXmlWriter : System.Xml.XmlTextWriter 
    {
      NoCommentXmlWriter([IO.Stream]$output) 
        : base($output, [System.Text.UTF8Encoding]::new($false))
      {
      }
    
      WriteComment([string]$comment)
      {
        # don't do anything with comments
      }
    }
    

    For calculating the hash checksum, you can skip the writer-to-reader-to-string step entirely - $ms.ToArray() would suffice for reading the stream contents into a byte array, which is exactly what you need in the very next step - but we don't even have to do that! HashAlgorithm.ComputeHash happily takes a stream object as an input argument:

    function Get-XmlElementHash {
      param(
        [Parameter(ValueFromPipeline)]
        [System.Xml.XmlElement[]]$Element
      )
    
      begin {
        # create stream + writer
        $xmlOutputStream = [System.IO.MemoryStream]::new()
        $writer = [NoCommentXmlWriter]::new($xmlOutputStream)
      }
    
      process { 
        # write any input elements to writer, flush to stream
        $_ |ForEach-Object WriteTo $writer
        $writer.Flush()
      }
    
      end {
        # calculate hash over entire output stream
        $hashAlg = [System.Security.Cryptography.SHA256]::Create()
        $hash = $hashAlg.ComputeHash($xmlOutputStream)
        
        return Write-Output $hash -NoEnumerate
      }
    } 
    

    Based on your description it sounds like you want to only calculate a checksum over a subset of the document.
    Since you haven't shared a sample document, I'll use this dummy document as an example - it has comments and stuff we don't care about:

    [xml]$xmlDoc = @'
    <root>
      <element>
        <childElement>
          <!-- nothing to see here 👀 -->
          <grandChildElement leaf="value" />
        </childElement>
      </element>
      <unimportantStuff>
        <nonsense />
      </unimportantStuff>
    </root>
    '@
    

    Below follows an entirely file-less example calculating, appending, reading, and verifying an embedded checksum:

    # we need 2 functions:
    #  - one to calculate/append the checksum after initial verification
    #  - one to verify the embedded signature
    
    function Append-XmlElementHash {
      param(
        [Parameter(ValueFromPipeline, Mandatory)]
        [xml]$Document,
        [string]$RootSelector = $('./*[1]')
      )
    
      # calculate hash over selected elements, using `Get-XmlElementHash` from earlier
      $hash = $Document.DocumentElement.SelectNodes($RootSelector) |Get-XmlElementHash
    
      # create <hash /> element to append to the document
      $hashChecksumElement = $Document.CreateElement('hash')
      $hashChecksumElement.InnerText = [System.BitConverter]::ToString($hash)
    
      # append hash element to document root element
      [void]$Document.DocumentElement.AppendChild($hashChecksumElement)
    }
    
    function Assert-XmlElementHash {
      param(
        [Parameter(ValueFromPipeline, Mandatory)]
        [xml]$Document,
        [string]$RootSelector = $('./*[1]')
      )
    
      $ErrorActionPreference = 'Stop'
    
      # start by locating and validating the format of the embedded hash checksum
      $hashElement = $Document.DocumentElement.SelectSingleNode('hash[last()]')
      if (-not$hashElement) {
        Write-Error 'hash element not found in document'
      }
    
      $embeddedChecksum = $hashElement.InnerText.Trim()
      if ($embeddedChecksum -cnotmatch '^(?:[0-9A-F]{2}\-){31}[0-9A-F]{2}$') {
        Write-Error 'unexpected hash element format'
      }
    
      # then calculate the _actual_ checksum based on the document contents
      $hash = $Document.DocumentElement.SelectNodes($RootSelector) |Get-XmlElementHash
    
      $actualChecksum = [System.BitConverter]::ToString($hash)
      if ($actualChecksum -cne $embeddedChecksum) {
        Write-Error "tampering detected! embedded hash checksum '$embeddedChecksum' does not match actual hash checksum '$actualChecksum'"
      }
    }
    
    
    # assume we've manually edited and validated the 
    # contents of the document by hand at this point
    
    # now we embed the hash
    $xmlDoc |Append-XmlElementHash -RootSelector element[1]
    
    # and now we can verify the integrity of the <element /> hierarchy sans comments
    try {
      $xmlDoc |Assert-XmlElementHash -RootSelector element[1]
      Write-Host "Hash looks good" -Foreground Green
    }
    catch {
      Write-Host "Verification failed! [${_}]" -Foreground Red
    }
    
    # we can safely add more comments without breaking the checksum
    [void]$xmlDoc.SelectSingleNode('//childElement').AppendChild(
      $xmlDoc.CreateComment('nothing to see here either!')
    )
    
    # ... and edit the contents of elements outside the hashed node set
    $xmlDoc.SelectSingleNode('//nonsense').SetAttribute('someAttribute', 'newValue')
    
    try {
      $xmlDoc |Assert-XmlElementHash -RootSelector element[1]
      Write-Host "Hash looks good after new comments" -Foreground Green
    }
    catch {
      Write-Host "Verification failed! [${_}]" -Foreground Red
    }
    
    # but editing the hashed node set will result in verification failure
    $xmlDoc.SelectSingleNode('//grandChildElement').SetAttribute('leaf', 'otherValue')
    
    try {
      $xmlDoc |Assert-XmlElementHash -RootSelector element[1]
      Write-Host "Hash looks good after new comments" -Foreground Green
    }
    catch {
      Write-Host "Verification failed! [${_}]" -Foreground Red
    }