regexpowershellwindows-scripting

PowerShell Script to Replace non-HTML Tags


I'm working on a PowerShell script that aims to find lines in HTML files containing angle brackets not belonging to HTML tags. The script should replace said angle brackets with < and >. However, I'm facing difficulties with the current script and the replacement logic seems to be not working as intended. It's worth noting that I'm operating on markdown files and I need to do this using Powershell version 5.1.22621.2428 without including anything from the outside.

To simplify the script, these are the requirements I posed as necessary for a tag to be interpreted as such:

Here's the test markdown file I've been using to test these scripts:

<
text
>
<ul>
   <li>[Message processing time] < [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
   <li>[Total process time of a group of messages] < [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
   </li>
</ul>
<>
> <
5 > 3 < 2 > 1

<a href="www.google.com">placeholder!!</a>

<hello >
< ciao>
<hello>
<hello></hello>
</b>
< /b>

Whose correct output should be:

&lt;
text
&gt;
<ul>
   <li>[Message processing time] &lt; [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
   <li>[Total process time of a group of messages] &lt; [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
   </li>
</ul>
&lt;&gt;
&gt; &lt;
5 &gt; 3 &lt; 2 &gt; 1

<a href="www.google.com">placeholder!!</a>

<hello >
&lt; hello&gt;
<hello></hello>
</b>
&lt; /b&gt;

What I've tried

I've tried a couple of different approaches. In the first attempt, I used a plain Powershell script, as required by the circumstances. This is the script I'm trying to fix.

This works great with tags alone, but when they are nested one inside another it breaks apart. Here is an example text for demonstration purposes:

<
text
>
<ul>
   <li>[Message processing time] < [time to send ack to Azure Service Bus] (about less than 100ms per message)</li>
   <li>[Total process time of a group of messages] < [Message Lock Time] (default: 1 min)<br><b>Strictly REQUIRED</b> to avoid lock loss and messages processed more than one time
   </li>
</ul>
<>
> <
5 > 3 < 2 > 1

<a href="www.google.com">placeholder!!</a>

Which is translated into:

&lt;
text
&gt;
<ul>
   &lt;li&gt;[Message processing time] &lt; [time to send ack to Azure Service Bus] (about less than 100ms per message)&lt;/li&gt;
   &lt;li&gt;[Total process time of a group of messages] &lt; [Message Lock Time] (default: 1 min)&lt;br&gt;&lt;b&gt;Strictly REQUIRED&lt;/b&gt; to avoid lock loss and messages processed more than one time
   </li>
</ul>
<>
&gt; &lt;
5 &gt; 3 &lt; 2 &gt; 1

<a href="www.google.com">placeholder!!</a>

Here's the code:

function find-nonHTMLtags($files) {
    foreach ($file in $files) {
        try {
            # Read the content of the file
            $content = Get-Content -Path $file.FullName -Raw

            # Process each line
            $modifiedContent = foreach ($line in $content -split '\r?\n') {
                # Replace < with &lt; if it is not part of a closed HTML tag or has a space after it
                if ($line -notmatch '<\s*(?:[^>]+)?>' -or $line -match '<\s') {
                    $line = $line -replace '<', '&lt;'
                }

                # Replace > with &gt; if it is not part of a closed HTML tag
                if ($line -notmatch '<\s*(?:[^>]+)?>') {
                    $line = $line -replace '>', '&gt;'
                }

                # Output the modified line or the original line if no changes were made
                $line
            }

            # Join the modified lines into the modified content
            $modifiedContent = $modifiedContent -join "`r`n"

            # Check if both $content and modified content are non-empty before determining modification
            if (-not [string]::IsNullOrEmpty($content) -and $content -ne $modifiedContent) {
                # Write the modified content back to the file
                $modifiedContent | Set-Content -Path $file.FullName -Encoding UTF8
                Write-Host "Changed non-HTML tag(s) at: $($file.FullName)"
            }
        }
        catch {
            Write-Host "`nCouldn't changed non-HTML tag(s) at: $($file.FullName). $_"
        }
    }
}

$mdFiles = Get-ChildItem -Path $path -File -Recurse -Filter '*.md'
find-nonHTMLtags $mdFiles

The second approach I've tried is using HAP through the .dll file. This works great, but sadly I've been told I can't use such files since they could pose a security threat. Here's the code anyway:

param (
    $path
)

function ReplaceSymbols($files) {
    foreach ($file in $files) {
        try {
            $content = Get-Content -Path $file.FullName -Raw

            Add-Type -Path (Join-Path $PSScriptRoot 'HtmlAgilityPack.dll')

            $htmlDocument = New-Object HtmlAgilityPack.HtmlDocument
            $htmlDocument.LoadHtml($content)

            # Iterate through each HTML node
            foreach ($node in $htmlDocument.DocumentNode.DescendantsAndSelf()) {
                # Check if the node is text
                if ($node.NodeType -eq 'Text') {
                    # Replace < with &lt; and > with &gt; only in text nodes
                    $node.InnerHtml = $node.InnerHtml -replace '<', '&lt;' -replace '>', '&gt;'
                }
            }

            if (-not [string]::IsNullOrEmpty($content) -and $content -ne $htmlDocument.DocumentNode.OuterHtml) {
                $htmlDocument.DocumentNode.OuterHtml | Set-Content -Path $file.FullName -Encoding UTF8
                Write-Host "File content modified: $($file.FullName)"
            }
        }
        catch {
            Write-Host "Error modifying file content: $($file.FullName). $_"
        }
    }
}

$mdFiles = Get-ChildItem -Path $path -File -Recurse -Include '*.md'
Write-Host "Markdown Files Count $($mdFiles.Count)"
ReplaceSymbols $filesToProcess

Another approach which works is using Javascript with NodeJS, but sadly I cannot use this approach either since NodeJS is not supported. Code:

const fs = require('fs');
const path = require('path');

function replaceNonHTMLtags(files) {
    files.forEach(filePath => {
        try {
            const content = fs.readFileSync(filePath, 'utf8');

            String.prototype.replaceAt = function (index, char) {
                let arr = this.split('');
                arr[index] = char;
                return arr.join('');
            };
            
            String.prototype.escape = function () {
                let p = /(?:<[a-zA-Z]+\s*[^>]*>)|(?:<\/[a-zA-Z]+>)|(?<lt><)|(?<gt>>)/g,
                result = this,
                match = p.exec(result);
            
                while (match !== null) {
                    if (match.groups.lt !== undefined) {
                        result = result.replaceAt(match.index, '&lt;');
                    } else if (match.groups.gt !== undefined) {
                        result = result.replaceAt(match.index, '&gt;');
                    }
                    match = p.exec(result);
                }
                return result;
            };
            
            

            // Perform modifications on the content
            const modifiedContent = content.escape();

            // Check if both content and modifiedContent aren't empty before doing anything else
            if (content !== '' && content !== modifiedContent) {
                // Write the modified content back to the file
                fs.writeFileSync(filePath, modifiedContent, 'utf8');
                console.log(`Edited: ${modifiedContent}`);
                console.log(`Edited: ${filePath}`);
            }
        } catch (error) {
            console.log(`Couldn't edit: ${filePath}. ${error.message}`);
        }
    });
}

const dynamicPath = ''; // Empty to use only __dirname
const orderFiles = fs.readdirSync(path.join(__dirname, dynamicPath)).filter(file => file.endsWith('.md')).map(file => path.join(dynamicPath, file));

console.log(`Markdown Files Count: ${orderFiles.length}`);
replaceNonHTMLtags(orderFiles);

Solution

  • As any regex-based HTML processing, the following isn't fully robust, but may work in your case:

    $modifiedContent = 
      (Get-Content -Raw $file) `
        -replace '<(?!(?:/\s*)?[a-z]+(?:\s+[^>]*)?/?>)',  '&lt;' `
        -replace '(?<!<(?:/\s*)?[a-z]+(?:\s+[^>]*)?/?)>', '&gt;'
    

    The gist of the approach is to use a negative look-ahead assertion ((?!…)) to make sure that what follows < isn't the rest of an HTML tag, and, analogously, a negative look-behind assertion ((?<!…)) to ensure that what precedes > isn't the start of one.

    For a detailed explanation of the regexes and the option to experiment with them, see this regex101.com page; for simplicity, the two regexes have been merged into a single one with alternation (|), with a placeholder replacement string, &[gl]t; to symbolize the two distinct replacements in the code above, &gt; and &lt;