ms-wordapache-poiopenxml

How to Compute Left Indentation for Numbered Paragraphs in OOXML (document.xml) from a DOCX File?


I'm working with a DOCX file and examining the document.xml (OOXML) file to determine the left indentation of paragraphs. Specifically, I'm trying to calculate the distance from the left margin to the first pixel of a numbered paragraph.

I've found this to be quite tricky, especially with numbered paragraphs. In some cases, the indentation equals indent, but this is not consistent. Sometimes it seems to be indent - hanging-indent.

I suspect there are other factors that influence the indentation calculation, as both MS Word and LibreOffice render the indentations correctly.

Can anyone explain the algorithm or rules used to compute the left indentation for numbered paragraphs in OOXML? Any insights or references to relevant documentation would be greatly appreciated.

(the data you see on the screenshot is edn format but it is directly obtained from the ooxml via Apache POI)

enter image description here

enter image description here

I'm adding more info / data, hopefully helping to understand my question. Blow are the two images of the same document but with different paragraphs selected to reflect the ruler corresponding to each.

good morning paragraph (good morning paragraph)

xxxxxxxxx paragraph (xxxxxxxxx paragraph)

the corresponding ooxml file (the corresponding ooxml file)

enter image description here enter image description here enter image description here

public static double computeIndent(Indentation indentation) {
    double leftIndent = indentation.indLeftInTwip == -1 ? 0.0 : indentation.indLeftInTwip;
    double firstLineIndent = indentation.indFirstLineInTwip == -1 ? 0.0 : indentation.indFirstLineInTwip;
    double hangingIndent = indentation.indHangingInTwip == -1 ? 0.0 : indentation.indHangingInTwip;

    double computedIndent;

    if (indentation.indLeftInTwip != -1 && indentation.indHangingInTwip != -1) {
        computedIndent = leftIndent - hangingIndent;
    } else if (indentation.indLeftInTwip != -1 && indentation.indFirstLineInTwip != -1) {
        computedIndent = leftIndent + firstLineIndent;
    } else if (indentation.indLeftInTwip != -1) {
        computedIndent = leftIndent;
    } else if (indentation.indFirstLineInTwip != -1) {
        computedIndent = firstLineIndent;
    } else {
        computedIndent = 0.0;
    }

    return computedIndent;
}

Solution

  • The problem is that, even if the file system Office Open XML was published now, the application software of Microsoft Office still is closed source. Thus nobody knows how Microsoft Word really processes the Office Open XML.

    According to the ECMA-376 Office Open XML file formats specification it should be like this:

    17.3.1.12 ind (Paragraph Indentation)

    This element specifies the set of indentation properties applied to the current paragraph. Indentation settings are overriden on an individual basis - if any single attribute on this element is omitted on a given paragraph, its value is determined by the setting previously set at any level of the style hierarchy (i.e. that previous setting remains unchanged). If any single attribute on this element is never specified in the style hierarchy, then no indentation of that indentation type is applied to the paragraph.

    This "style hierarchy" is pecified like so:

    17.7.2 Style Hierarchy

    This process can be described as follows:

    • First, the document defaults are applied to all runs and paragraphs in the document.
    • Next, the table style properties are applied to each table in the document, following the conditional formatting inclusions and
      exclusions specified per table.
    • Next, numbered item and paragraph properties are applied to each paragraph formatted with a numbering style.
    • Next, paragraph and run properties are applied to each paragraph as defined by the paragraph style.
    • Next, run properties are applied to each run with a specific character style applied.
    • Finally, we apply direct formatting (paragraph or run properties not from styles). If this direct formatting includes numbering, that
      numbering + the associated paragraph properties are applied.

    If, and only if, Microsoft processes the Office Open XML exactly as specified, then following should be true:

    The paragraph indentation is part of paragraph properties. These can be stored in document.xml directly in paragraph XML - direct formatting. If so, then all that indentation settings apply, except the not set there.

    Run properties cannot set indentation settings.

    For not set indentation settings, paragraph properties of abstract numberings linked to the paragraph apply. The abstract numbering is linked to the paragraph by a num-ID and a num-indentation-level. Both links to an abstact numbering in numbering.xml and a special num-indentation-level. The num-indentation-level may have paragraph properties containing indentation settings.

    For further not set indentation settings, paragraph properties of a style linked to the paragraph apply. A style is linked to the paragraph by a style-ID. This links to a style in styles.xml which may have paragraph properties containing indentation settings.

    Not clear whether table style properties may have paragraph properties too. But I don't think so as table cells contain default paragraphs.

    The document defaults may set up paragraph properties but I've never seen that document defaults set up indentations. But nevertheless we test it.

    So to get the applying paragraph indentation using Apache POI:

    1. At first test indentation via paragraph properties of XWPFParagraph.
    2. At second test indentation via paragraph properties of XWPFAbstractNum and update indentation settings which not got set in first step.
    3. At third test indentation via paragraph properties of XWPFStyle and update indentation settings which not got set in first and second step.
    4. At fourth test indentation via default paragraph properties of XWPFDefaultParagraphStyle and update indentation settings which not got set in first, second and third step.

    For me that gets correct indentation for all my test cases. But who knows... ;-). See first paragraph in this answer.

    Complete code to test:

    import java.io.FileInputStream;
    import org.apache.poi.xwpf.usermodel.*;
    import java.math.BigInteger;
    
    public class WordGetParagraphIndentation {
        
     static Indentation getParagraphIndentationFromNumIlvl(XWPFAbstractNum abstractNum, BigInteger numIlvl, Indentation indentation) {
      boolean found = false;
      for (org.openxmlformats.schemas.wordprocessingml.x2006.main.CTLvl lvl : abstractNum.getCTAbstractNum().getLvlList()) {
       if (numIlvl == null) { // no Ilvl, take the first lvl
        found = true;
       } else if (numIlvl != null && lvl.getIlvl().equals(numIlvl)) { // if Ilvl find the correct lvl
        found = true;  
       }
       if (found) {
        if (lvl.getPPr() != null) {
         if (lvl.getPPr().getInd() != null) {
          if (lvl.getPPr().getInd().getLeft() != null) {
           if(indentation.indLeftInTwip == -1) indentation.indLeftInTwip = Integer.valueOf(String.valueOf(lvl.getPPr().getInd().getLeft()));
          }
          if (lvl.getPPr().getInd().getHanging() != null) {
           if(indentation.indHangingInTwip == -1) indentation.indHangingInTwip = Integer.valueOf(String.valueOf(lvl.getPPr().getInd().getHanging()));
          }
          if (lvl.getPPr().getInd().getFirstLine() != null) {
           if(indentation.indFirstLineInTwip == -1) indentation.indFirstLineInTwip = Integer.valueOf(String.valueOf(lvl.getPPr().getInd().getFirstLine()));
          }
          if (lvl.getPPr().getInd().getRight() != null) {
           if(indentation.indRightInTwip == -1) indentation.indRightInTwip = Integer.valueOf(String.valueOf(lvl.getPPr().getInd().getRight()));
          }
         }
        }
        break;  
       }
      }
      return indentation; 
     }
     
     static Indentation getParagraphIndentationFromStyle(XWPFStyle style, Indentation indentation) {
      org.openxmlformats.schemas.wordprocessingml.x2006.main.CTStyle ctStyle = style.getCTStyle();
      if (ctStyle.getPPr() != null) {
       if (ctStyle.getPPr().getInd() != null) {
        if (ctStyle.getPPr().getInd().getLeft() != null) {
         if(indentation.indLeftInTwip == -1) indentation.indLeftInTwip = Integer.valueOf(String.valueOf(ctStyle.getPPr().getInd().getLeft()));
        }
        if (ctStyle.getPPr().getInd().getHanging() != null) {
         if(indentation.indHangingInTwip == -1) indentation.indHangingInTwip = Integer.valueOf(String.valueOf(ctStyle.getPPr().getInd().getHanging()));
        }
        if (ctStyle.getPPr().getInd().getFirstLine() != null) {
         if(indentation.indFirstLineInTwip == -1) indentation.indFirstLineInTwip = Integer.valueOf(String.valueOf(ctStyle.getPPr().getInd().getFirstLine()));
        }
        if (ctStyle.getPPr().getInd().getRight() != null) {
         if(indentation.indRightInTwip == -1) indentation.indRightInTwip = Integer.valueOf(String.valueOf(ctStyle.getPPr().getInd().getRight()));
        }
       }
      }
      return indentation; 
     }
     
     static Indentation getParagraphIndentationFromDefaultParagraphStyle(XWPFDefaultParagraphStyle defaultParagraphStyle, Indentation indentation) {
      try {
       java.lang.reflect.Method getPPr = XWPFDefaultParagraphStyle.class.getDeclaredMethod("getPPr");
       getPPr.setAccessible(true);
       org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPrGeneral ctPPrGeneral = (org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPrGeneral)getPPr.invoke(defaultParagraphStyle);
       if (ctPPrGeneral != null) {
        if (ctPPrGeneral.getInd() != null) {
         if (ctPPrGeneral.getInd().getLeft() != null) {
          if(indentation.indLeftInTwip == -1) indentation.indLeftInTwip = Integer.valueOf(String.valueOf(ctPPrGeneral.getInd().getLeft()));
         }
         if (ctPPrGeneral.getInd().getHanging() != null) {
          if(indentation.indHangingInTwip == -1) indentation.indHangingInTwip = Integer.valueOf(String.valueOf(ctPPrGeneral.getInd().getHanging()));
         }
         if (ctPPrGeneral.getInd().getFirstLine() != null) {
          if(indentation.indFirstLineInTwip == -1) indentation.indFirstLineInTwip = Integer.valueOf(String.valueOf(ctPPrGeneral.getInd().getFirstLine()));
         }
         if (ctPPrGeneral.getInd().getRight() != null) {
          if(indentation.indRightInTwip == -1) indentation.indRightInTwip = Integer.valueOf(String.valueOf(ctPPrGeneral.getInd().getRight()));
         }
        }
       }
      } catch (Exception ex) {
       ex.printStackTrace();
      }
      return indentation; 
     }
                   
     static Indentation getParagraphIndentation(XWPFParagraph paragraph) {
      Indentation indentation = new Indentation();
    
      // first test indentation via paragraph properties of XWPFParagraph
      indentation.indLeftInTwip = paragraph.getIndentationLeft();
      indentation.indHangingInTwip = paragraph.getIndentationHanging();
      indentation.indFirstLineInTwip = paragraph.getIndentationFirstLine();
      indentation.indRightInTwip = paragraph.getIndentationRight();
     
      // second test indentation via paragraph properties of XWPFAbstractNum 
      if (indentation.notAll()) {
       BigInteger numID = paragraph.getNumID();
       BigInteger numIlvl = paragraph.getNumIlvl();
       if (numID != null) {
        XWPFNumbering numbering = paragraph.getDocument().getNumbering();
        if (numbering != null) {
         XWPFNum num = numbering.getNum(numID);
         if (num != null) {
          BigInteger abstractNumID = num.getCTNum().getAbstractNumId().getVal();
          XWPFAbstractNum abstractNum = numbering.getAbstractNum(abstractNumID);
          indentation = getParagraphIndentationFromNumIlvl(abstractNum, numIlvl, indentation);
         }
        }
       }   
      }
    
      // third test indentation via paragraph properties of XWPFStyle 
      if (indentation.notAll()) {
       String styleID = paragraph.getStyleID();
       if (styleID != null) {
        XWPFStyles styles = paragraph.getDocument().getStyles();
        if (styles != null) {
         XWPFStyle style = styles.getStyle(styleID);
         if (style != null) {
          indentation = getParagraphIndentationFromStyle(style, indentation);
         }
        }   
       }
      }
      
      // fourth test indentatio via paragraph properties of  XWPFDefaultParagraphStyle
      if (indentation.notAll()) {
       XWPFStyles styles = paragraph.getDocument().getStyles();
       if (styles != null) {
        XWPFDefaultParagraphStyle defaultParagraphStyle = styles.getDefaultParagraphStyle();
        indentation = getParagraphIndentationFromDefaultParagraphStyle(defaultParagraphStyle, indentation);    
       }
      }
     
      // fifth ???
      
      return indentation;
     }
    
     public static void main(String[] args) { 
      try ( XWPFDocument document = new XWPFDocument(new FileInputStream("./WordDocument.docx")); ) {
      
       for (XWPFParagraph paragraph : document.getParagraphs()) {
        Indentation indentation = getParagraphIndentation(paragraph);
        System.out.println(paragraph.getText());
        System.out.println(indentation);
       }
       
      } catch (Exception ex) {
       ex.printStackTrace();
      }
     }
     
     static class Indentation {
      int indLeftInTwip;
      int indHangingInTwip;
      int indFirstLineInTwip;
      int indRightInTwip;
      Indentation() {
       this.indLeftInTwip = -1;
       this.indHangingInTwip = -1;
       this.indFirstLineInTwip = -1;
       this.indRightInTwip = -1;
      }
      boolean notAll() {
       return this.indLeftInTwip == -1 || this.indHangingInTwip == -1 || this.indFirstLineInTwip == -1 || this.indRightInTwip == -1;
      }
      public String toString() {
       this.indLeftInTwip = (this.indLeftInTwip == -1)?0:this.indLeftInTwip;
       this.indHangingInTwip = (this.indHangingInTwip == -1)?0:this.indHangingInTwip;
       this.indFirstLineInTwip = (this.indFirstLineInTwip == -1)?0:this.indFirstLineInTwip;
       this.indRightInTwip = (this.indRightInTwip == -1)?0:this.indRightInTwip;
       return "left:=" + this.indLeftInTwip/1440f*25.4 + "mm" + "\n"
            + "hanging:=" + this.indHangingInTwip/1440f*25.4 + "mm" + "\n"
            + "first line:=" + this.indFirstLineInTwip/1440f*25.4 + "mm" + "\n"
            + "right:=" + this.indRightInTwip/1440f*25.4 + "mm" ;
      }
     }
    }