eclipsexslt-2.0rtfsaxonapache-tika

How would I handle RTF hyperlinks using Apache Tika in XSLT?


This question is a follow-up to: What are some methods to converting RTF text nodes in XML using XSLT 2 / Saxon HE 11.3?.

After implementing the answered solution, I ran the code against a large dataset. During the processing of all that data, an item in source RTF caused the application to error.

The error:

Error on line 11 column 92 of urn:from-string:  SXXP0003   Error reported by XML parser: The element type "a" must be terminated by the matching end-tag "</a>".: The element type "a" must be terminated by the matching end-tag "</a>".

I took a look at the source xml, which contained several RTF HYPERLINK codes. Source:

<SPECORMETHOD>{\rtf1\ansi\deff0\uc1\ansicpg1252\deftab720{\fonttbl{\f0\fnil\fcharset1 Arial;}{\f1\fnil\fcharset1 Times New Roman;}{\f2\fnil\fcharset1 WingDings;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green128\blue0;\red0\green0\blue255;\red255\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red128\green0\blue0;\red0\green255\blue0;\red0\green255\blue255;\red0\green128\blue128;\red0\green0\blue128;\red255\green255\blue255;\red192\green192\blue192;\red128\green128\blue128;\red0\green0\blue0;\red128\green128\blue0;}\wpprheadfoot1\paperw12240\paperh15840\margl720\margr720\margt720\margb720\headery720\footery720\endnhere\sectdefaultcl{\*\generator WPTools_5.17;}{\stylesheet{\s1\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20 Normal;}{\s2\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20 Default Paragraph Font;}{\s3\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs20\cf3\ul\sbasedon2 Hyperlink;}}{\pard\plain\plain\f1\fs36\par\pard\plain\plain\f1\fs36\par\plain\f1\fs28\tab 10\'94Flour Tortilla\par\plain\f1\fs28\tab Caesar \f1\b\i DIP\f1\i0 : {\field{\*\fldinst{HYPERLINK "..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc"}}{\*\fldtitle{..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc}}{\fldrslt{\f1\cf3\cs103\ul\cs3 Dip, Caesar.doc\plain\f1\fs28\b}}}\par\plain\f1\fs28\tab Ripped Romaine\par\plain\f1\fs28\tab Blackened Salmon julienne\par\plain\f1\fs28\tab Shaved Red Onion\par\plain\f1\fs28\tab Julienne Tomato\par\plain\f1\fs28\tab Grated Parmesan\par\plain\f1\fs28\tab Blackening spice: {\field{\*\fldinst{HYPERLINK "..\\\\..\\\\SPICE\\\\Blackening Spice.doc"}}{\*\fldtitle{..\\\\..\\\\SPICE\\\\Blackening Spice.doc}}{\fldrslt{\f1\cf3\cs103\ul\cs3 Blackening Spice.doc\plain\f1\fs28}}}\par\pard\plain\plain\f1\fs28\par\plain\f1\fs28 Method\par\plain\f1\fs28 Procedure Text \par\pard\plain\plain\f1\fs36\par}}</SPECORMETHOD>

For my purposes, the URL is not going to be a functional component, but for the sake of utility of this RTF conversion project, what might be needed to have the hyperlink codes work correctly, or to output them as text for reference? One way I can handle this is in the XSLT by intercepting the element, looking for the HYPERLINK code and replacing it with regular text.

The desired output for a hyperlink from this example would be (text only):

CAESAR DIP: ..\..\SAUCES\Dips\Dip, Caesar.doc

The only modification to the original code was in XSLT to do a check for an empty element when processing the <SPECORMETHOD>.

<xsl:choose>
    <xsl:when test="string-length(SPECORMETHOD) &gt; 0">
        <rtf-as-xhtml>
            <xsl:sequence select="tika:parse-rtf(SPECORMETHOD[string-length(.) &gt; 0])"/>
        </rtf-as-xhtml>
    </xsl:when>
    <xsl:otherwise>
        <xsl:value-of select="'[EMPTY]'"/>
    </xsl:otherwise>
</xsl:choose>

I've built this project in Eclipse 2022-12 (4.26.0). It's a Maven project using Apache Tika 2.7.0, and Saxon HE 11.3, using Java SE 1.8. Special thanks to Martin H.


Solution

  • I have run your sample rtf through Tika and the supposed XHTML output is unfortunately not well-formed:

    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
    <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.rtf.RTFParser" />
    <meta name="Content-Type" content="application/rtf" />
    <title></title>
    </head>
    <body><p />
    <p />
    <p> 10”Flour Tortilla</p>
    <p> Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips\\Dip, Caesar.doc">Dip, Caesar.doc</b><b /></b></p>
    <p><b />    Ripped Romaine</p>
    <p> Blackened Salmon julienne</p>
    <p> Shaved Red Onion</p>
    <p> Julienne Tomato</p>
    <p> Grated Parmesan</p>
    <p> Blackening spice: <a href="..\\..\\SPICE\\Blackening Spice.doc">Blackening Spice.doc</a></p>
    <p />
    <p>Method</p>
    <p>Procedure Text </p>
    <p />
    <p />
    </body></html>
    

    So the error is in the fragment <p> Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips\\Dip, Caesar.doc">Dip, Caesar.doc</b><b /></b></p>.

    I don't know for sure whether that is a problem with the input somehow not being proper rtf but it looks more like a bug in the Tia parser and ToXmlContentHandler.

    I have raised the potential issue https://issues.apache.org/jira/browse/TIKA-3972

    In the end, with the help of the Saxonica guys (thanks to Michael Kay and Norm Walsh) I have found a better (probably anyway) approach of using Saxon with the Tika parser; instead of using Tika's ToXMLContentHandler() and its toString() method result fed to Saxon's DocumentBuilder it is possible to pass a Saxon BuildingContentHandler to Tika's parser directly to get an XdmNode:

    public static XdmNode parseRtfToHTML2(String rtf, Processor processor) throws IOException, SAXException, TikaException, URISyntaxException, SaxonApiException {
        DocumentBuilder docBuilder = processor.newDocumentBuilder();
    
    
        BuildingContentHandler handler = docBuilder.newBuildingContentHandler();
    
        AutoDetectParser parser = new AutoDetectParser();
        Metadata metadata = new Metadata();
        try (InputStream stream = new ByteArrayInputStream(rtf.getBytes("utf8"))) {
            parser.parse(stream, handler, metadata);
            return handler.getDocumentNode();//docBuilder.build(new StreamSource(new StringReader(handler.toString())));
        } catch (SaxonApiException e) {
            throw new RuntimeException(e);
        }
    }
    

    Using that approach, at least in a short test, no error is thrown for the hyperlink RTF example, see the updated project https://github.com/martin-honnen/SaxonTikaRtfTest1 for the code in more context.