I have a large number of HTML (and possibly other xml) documents that I need to redact.
The redactions are typically of the form "John Doe" -> "[Person A]". The text to be redacted may be in headers or paragraphs, but will almost always be in paragraphs.
Simple string substitutions really. Not very complicated things.
However, I do want to preserve document structure, and I would prefer to not reinvent any wheels. String substitution in the document text may do the job, but also may break document structure, so it will be a last option.
Right now I have stared at XSLT for an hour and tried to force "str:replace" to do my bidding. I will spare you from viewing me feeble attempts that didn't work, but I will ask this: Is there a simple and know way to apply my redactions using XSLT, and could you post it here?
Thank you in advance.
Update: at the request of Martin Honnen I'm adding my input files, as well as the command I used to get the latest error message. From this it will be apparent that I'm a complete n00b when it comes to XSLT :-)
.html file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"/> <title>TodaysDate</title> <meta name="created" content="2020-11-04T30:45:00"/> </head> <body> <ol start="2"> <li><p> John Doe on 9. fux 2057 together with Henry Fluebottom formed the company Doe &; Fluebottom Widgets Inc. </p> </ol> </body> </html>
The XSLT transformation file:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
<xsl:template match="p">
<xsl:copy>
<xsl:attribute name="matchesPattern">
<xsl:copy-of select='str:replace("John Doe", ".*", "[Person A]")'/>
</xsl:attribute>
<xsl:copy-of select='str:replace("Henry Fluebottom", ".*", "[Person B]")'/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The command and the output:
$ xsltproc -html transform.xsl example.html
xmlXPathCompOpEval: function replace bound to undefined prefix str
xmlXPathCompiledEval: 2 objects left on the stack.
<?xml version="1.0"?>
TodaysDate
<p matchesPattern=""/>
$
xsltproc is based on libxslt and that way supports various EXSLT functions like str:replace
, to use it you will need to declare the namespace
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:str="http://exslt.org/strings"
exclude-result-prefixes="str"
version="1.0">
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p//text()">
<xsl:value-of select="str:replace(., 'John Doe', '[Person A]')"/>
</xsl:template>
</xsl:stylesheet>