xmllistxsltdocbook

Convert lists in Word to Docbook transformation


A lot of Word documents (Word 2003 xml) are to be converted into Docbook 5.1 (30 documents, approx. 80 pages each). I have created a stylesheet for this purpose and it works so far. However, I am not getting anywhere with the following problem:

There are many lists in the documents. The Word XML marks out list items (<w:listPr>), but as far as I can see, it does not indicate where the list begins and ends. There are only list points.

In XSLT I can now capture the list items (<listitem>), but I don't know how to surround the list items with the global list tag (<itemizedlist>).

One way could be to capture the lists with for-each-group or something and copy the text-content of the nodes in my target document. But there are other formatting/elements in the list items like <InstrText> (Docbook: <indexterm>) which should not be lost.

How can I handle this?

Word 2003 xml Source (Excerpt)

<w:p>
     <w:pPr>
        <w:pStyle w:val="2Standard"/>
            <w:listPr>
                 <w:ilvl w:val="0"/>
                 <w:ilfo w:val="14"/>
                 <wx:t wx:val="·"/>
                 <wx:font wx:val="Symbol"/>
            </w:listPr>
      </w:pPr>
     <w:r>
          <w:t>die Prognose der Wirtschaft</w:t>
      </w:r>
       <w:r>
          <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r>
          <w:instrText> XE "Wirtschaft"</w:instrText>
      </w:r>
      <w:r>
          <w:fldChar w:fldCharType="end"/>
      </w:r>
</w:p>
<w:p>
     <w:pPr>
        <w:pStyle w:val="2Standard"/>
            <w:listPr>
                 <w:ilvl w:val="0"/>
                 <w:ilfo w:val="14"/>
                 <wx:t wx:val="·"/>
                 <wx:font wx:val="Symbol"/>
            </w:listPr>
      </w:pPr>
      <w:r>
          <w:t>die Beratung der Politik.</w:t>
      </w:r>
</w:p>" 
 

Desired Output


<itemizedlist>
     <listitem>
         <para>die Prognose der Wirtschaft 
            <indexterm><primary>Wirtschaft</primary></indexterm>
         </para>
      </listitem>
      <listitem>
         <para>die Beratung der Politik.</para>
      </listitem>
</itemizedlist>

First Stylesheet approach

<xsl:template match="w:p">
        <xsl:choose>
            <xsl:when test="w:pPr/w:listPr/w:ilvl/@w:val = '0'">
                <listitem>
                    <para>
                       <xsl:apply-templates select="w:r"/>
                    </para>
                </listitem>
            </xsl:when>
            <xsl:otherwise>
                <para>
                    <xsl:apply-templates/>
                </para>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

    <xsl:template match="w:r">
        <xsl:choose>
            <xsl:when test="w:instrText">
                <indexterm>
                    <primary>
                        <xsl:apply-templates select="*/text()"/>
                    </primary>
                </indexterm>
            </xsl:when>
            <xsl:otherwise>
                <xsl:apply-templates select="w:t"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>

Solution

  • I think it should be possible with an approach along the lines of

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        xpath-default-namespace="http://example.com/"
        exclude-result-prefixes="#all"
        version="3.0">
    
      <xsl:output method="xml" indent="yes" suppress-indentation="indexterm"/>
      <xsl:strip-space elements="*"/>
    
      <xsl:template match="root">
        <xsl:for-each-group select="p" group-adjacent="boolean(self::p[pPr/listPr])">
            <xsl:choose>
              <xsl:when test="current-grouping-key()">
                <itemizedlist>
                  <xsl:apply-templates select="current-group()" mode="list"/>
                </itemizedlist>
              </xsl:when>
              <xsl:otherwise>
                <xsl:apply-templates select="current-group()"/>
              </xsl:otherwise>
            </xsl:choose>
        </xsl:for-each-group>
      </xsl:template>
      
      <xsl:template match="p" mode="list">
        <listitem>
          <para>
            <xsl:apply-templates mode="#current"/>
          </para>
        </listitem>
      </xsl:template>
      
      <xsl:template match="instrText" mode="list">
        <indexterm>
          <primary>
            <xsl:apply-templates mode="#current"/>
          </primary>
        </indexterm>
      </xsl:template>
      
    </xsl:stylesheet>
    

    This transforms

    <w:root xmlns:w="http://example.com/" xmlns:wx="http://example.com/wx">
      <w:p>
         <w:pPr>
            <w:pStyle w:val="2Standard"/>
                <w:listPr>
                     <w:ilvl w:val="0"/>
                     <w:ilfo w:val="14"/>
                     <wx:t wx:val="·"/>
                     <wx:font wx:val="Symbol"/>
                </w:listPr>
          </w:pPr>
         <w:r>
              <w:t>die Prognose der Wirtschaft</w:t>
          </w:r>
           <w:r>
              <w:fldChar w:fldCharType="begin"/>
          </w:r>
          <w:r>
              <w:instrText> XE "Wirtschaft"</w:instrText>
          </w:r>
          <w:r>
              <w:fldChar w:fldCharType="end"/>
          </w:r>
    </w:p>
    <w:p>
         <w:pPr>
            <w:pStyle w:val="2Standard"/>
                <w:listPr>
                     <w:ilvl w:val="0"/>
                     <w:ilfo w:val="14"/>
                     <wx:t wx:val="·"/>
                     <wx:font wx:val="Symbol"/>
                </w:listPr>
          </w:pPr>
          <w:r>
              <w:t>die Beratung der Politik.</w:t>
          </w:r>
    </w:p> 
    </w:root>
    

    into

    <itemizedlist>
       <listitem>
          <para>die Prognose der Wirtschaft<indexterm><primary> XE "Wirtschaft"</primary></indexterm>
          </para>
       </listitem>
       <listitem>
          <para>die Beratung der Politik.</para>
       </listitem>
    </itemizedlist>
    

    Consider to provide namespace well-formed samples/snippets the next time.