htmlxmlxslttransformationredaction

Applying redactions in the form of string substitutions to HTML documents using XSLT


I have a large number of HTML (and possibly other xml) documents that I need to redact.

The redactions are typically of the form "John Doe" -> "[Person A]". The text to be redacted may be in headers or paragraphs, but will almost always be in paragraphs.

Simple string substitutions really. Not very complicated things.

However, I do want to preserve document structure, and I would prefer to not reinvent any wheels. String substitution in the document text may do the job, but also may break document structure, so it will be a last option.

Right now I have stared at XSLT for an hour and tried to force "str:replace" to do my bidding. I will spare you from viewing me feeble attempts that didn't work, but I will ask this: Is there a simple and know way to apply my redactions using XSLT, and could you post it here?

Thank you in advance.

Update: at the request of Martin Honnen I'm adding my input files, as well as the command I used to get the latest error message. From this it will be apparent that I'm a complete n00b when it comes to XSLT :-)

.html file:


    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
    <html>
      <head>
        <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
        <title>TodaysDate</title>
        <meta name="created" content="2020-11-04T30:45:00"/>
      </head>
      <body>
        <ol start="2">
          <li><p> John Doe on 9. fux 2057 together with Henry
          Fluebottom formed the company Doe &; Fluebottom Widgets
          Inc. </p>
        </ol>
      </body>
    </html>

The XSLT transformation file:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        >
<xsl:template match="p">
  <xsl:copy>
<xsl:attribute name="matchesPattern">
  <xsl:copy-of select='str:replace("John Doe", ".*",  "[Person A]")'/>
</xsl:attribute>
  <xsl:copy-of select='str:replace("Henry Fluebottom", ".*",  "[Person B]")'/>
  </xsl:copy>
</xsl:template>
</xsl:stylesheet>

The command and the output:

$  xsltproc -html transform.xsl example.html
xmlXPathCompOpEval: function replace bound to undefined prefix str
xmlXPathCompiledEval: 2 objects left on the stack.
<?xml version="1.0"?>



    TodaysDate




      <p matchesPattern=""/>  

$ 

Solution

  • xsltproc is based on libxslt and that way supports various EXSLT functions like str:replace, to use it you will need to declare the namespace

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:str="http://exslt.org/strings"
        exclude-result-prefixes="str"
        version="1.0">
    
        <xsl:template match="@* | node()">
            <xsl:copy>
                <xsl:apply-templates select="@* | node()"/>
            </xsl:copy>
        </xsl:template>
    
        <xsl:template match="p//text()">
            <xsl:value-of select="str:replace(., 'John Doe', '[Person A]')"/>
        </xsl:template>
    
    </xsl:stylesheet>