xmlxpathxsltapache-fophocr

How to convert Tesseract software output (hocr) into plain txt file with fop (generates zero output)?


The resulting output: a txt file with empty lines.

The expected output: a txt file with words of "Привет Мир! Это я, обычный неработающий текст или рыба" text.

What am I doing wrong? Tried nested xsl:for-each code gives out the same kind of behavior.


Solution

  • I see 2 problems in your attempt:

    1. Your instruction:

      <xsl:for-each select="//div [@class='ocr_page'] /div [@class='ocr_carea'] / p [@class='ocr_par'] / span[@class='ocr_line'] / span [@class='ocrx_word']">
      

      selects nothing, because your input XML puts all its elements in a namespace. See here how to solve this.

    2. Once you have it working, this instruction will put you in the context of span. From this context, your next instruction:

       <xsl:value-of select="normalize-space(span [@class='ocrx_word'])" disable-output-escaping="yes"/>
      

      also selects nothing, because span is not a child of itself. It should be:

      <xsl:value-of select="normalize-space(.)"/>
      

      and I doubt you want to disable output escaping in a stylesheet producing an XML result.