javams-wordxml-parsingapache-poixwpf

Is there a way to extract position and dimension of text boxes from word document using Apache POI


I am trying to extract information about the positions and dimensions of text boxes in a Word document using Apache POI. In Aspose.Words there are methods and classes to handle text boxes and other shapes, are those available in Apache POI. If there is a way to extract information about the position and dimensions of text boxes using Apache POI it would be helpful.

XWPFDocument document = new XWPFDocument(OPCPackage.open(fis));
List<XWPFParagraph> paragraphs =  document.getParagraphs();
XmlObject[] textBoxObjects;
for(XWPFParagraph paragraph : paragraphs) 
{
    textBoxObjects =  paragraph.getCTP().selectPath(
"declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' " +
"declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape' " +
"declare namespace v='urn:schemas-microsoft-com:vml'"+ ".//*/wps:txbx/w:txbxContent | .//*/v:textbox/w:txbxContent");
//                  
    for (int i =0; i < textBoxObjects.length; i++){
        XWPFParagraph embeddedPara = null;
        try {
            XmlObject[] paraObjects = textBoxObjects[i].selectChildren(new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "p"));
    for (int j=0; j<paraObjects.length; j++) {
        embeddedPara = new XWPFParagraph(CTP.Factory.parse(paraObjects[j].xmlText()), paragraph.getBody());
        System.out.println(embeddedPara.getText());
    } 
    }
}

Using the above script I am extracting the text from text boxes, but I also need information about the position and dimensions of text boxes to proceed forward.
If there is a way to extract those information, it would be helpful. Example screenshot of a word document containing text boxes

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" mc:Ignorable="w14 wp14 w15"><w:body><w:p><w:pPr><w:pStyle w:val="Normal"/><w:bidi w:val="0"/><w:jc w:val="left"/><w:rPr></w:rPr></w:pPr><w:r><w:rPr></w:rPr><mc:AlternateContent><mc:Choice Requires="wps"><w:drawing><wp:anchor behindDoc="0" distT="0" distB="0" distL="0" distR="0" simplePos="0" locked="0" layoutInCell="1" allowOverlap="1" relativeHeight="2"><wp:simplePos x="0" y="0"/><wp:positionH relativeFrom="column"><wp:posOffset>457200</wp:posOffset></wp:positionH><wp:positionV relativeFrom="paragraph"><wp:posOffset>565150</wp:posOffset></wp:positionV><wp:extent cx="1664970" cy="1148715"/><wp:effectExtent l="0" t="0" r="0" b="0"/><wp:wrapNone/><wp:docPr id="1" name="Text Frame 1"></wp:docPr><a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"><a:graphicData uri="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"><wps:wsp><wps:cNvSpPr/><wps:spPr><a:xfrm><a:off x="0" y="0"/><a:ext cx="1665000" cy="1148760"/></a:xfrm><a:prstGeom prst="rect"><a:avLst></a:avLst></a:prstGeom><a:noFill/><a:ln w="0"><a:noFill/></a:ln></wps:spPr><wps:style><a:lnRef idx="0"></a:lnRef><a:fillRef idx="0"/><a:effectRef idx="0"></a:effectRef><a:fontRef idx="minor"/></wps:style><wps:txbx><w:txbxContent><w:p><w:pPr><w:pStyle w:val="FrameContents"/><w:overflowPunct w:val="false"/><w:bidi w:val="0"/><w:rPr><w:color w:val="000000"/></w:rPr></w:pPr><w:r><w:rPr><w:color w:val="000000"/></w:rPr><w:t>Text 1</w:t></w:r></w:p><w:p><w:pPr><w:pStyle w:val="FrameContents"/><w:overflowPunct w:val="false"/><w:bidi w:val="0"/><w:rPr><w:color w:val="000000"/></w:rPr></w:pPr><w:r><w:rPr><w:color w:val="000000"/></w:rPr></w:r></w:p></w:txbxContent></wps:txbx><wps:bodyPr lIns="0" rIns="0" tIns="0" bIns="0" anchor="t"><a:noAutofit/></wps:bodyPr></wps:wsp></a:graphicData></a:graphic></wp:anchor></w:drawing></mc:Choice>

In the above XML file there are posOffset values for an anchor tag, I was hoping I can retrieve that values to find the position of any drawing object in a Word document. And inside the <wp:extent> tag there is the dimensions of the drawing element. That is what I am trying to extract using Apache POI. Thanks in advance.


Solution

  • Apache POI XWPF does not support any kind of shapes in XWPFDocument and it's body elements. That is also true for text box shapes.

    But, of course, the information is all in the XML behind XWPFDocument. Therefore, if one knows which XML elements have what special meaning, then one can get all information from this XML using XML methods.

    In How to get the drawings from the apache POI XWPFDocument? I have already shown how to get all the drawings from the Apache POI XWPFDocument. But next problem will be how to get contents out of the drawing elements. A method to get CTTxbxContent is in linked answer. But to get shape dimensions other methods are needed.

    It does not make it easier that Microsoft decided to use many different strange length measurement units. There are sometimes Twips (Twentieth of an Inch Point) used, sometimes EMU (English Metric Units), sometimes half pt, eighth pt, and so on … Consistently never really metric units, such as parts of Meter (cm, mm, ..) are used. Welcome in 21th century.

    Following code shows methods to get width and height out of the XML saved shape extent, either from inline shapes or from anchored shapes.

    import java.io.FileInputStream;
    
    import org.apache.poi.xwpf.usermodel.*;
    import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
    import org.apache.poi.util.Units;
    
    import org.apache.xmlbeans.XmlObject;
    import org.apache.xmlbeans.XmlCursor;
    
    import java.util.List;
    import java.util.ArrayList;
    
    public class WordGetAllDrawingsFromRuns {
    
     private static List<CTDrawing> getAllDrawings(XWPFRun run) throws Exception {
      CTR ctR = run.getCTR();
      XmlCursor cursor = ctR.newCursor();
      cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//w:drawing");
      List<CTDrawing> drawings = new ArrayList<CTDrawing>();
      while (cursor.hasNextSelection()) {
       cursor.toNextSelection();
       XmlObject obj = cursor.getObject();
       CTDrawing drawing = CTDrawing.Factory.parse(obj.newInputStream());
       drawings.add(drawing);
      }
      return drawings;
     }
      
     private static String getTextBoxContent(CTDrawing drawing) {
      StringBuilder result = new StringBuilder();
      XmlCursor cursor = drawing.newCursor();
      cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//w:txbxContent");
      while (cursor.hasNextSelection()) {
       cursor.toNextSelection();
       result.append(cursor.getTextValue());
      }
      return result.toString();
     }
     
     private static Integer getAnchorExtentWidthInEMU(CTDrawing drawing) {
      Integer result = null;
      XmlCursor cursor = drawing.newCursor();
      cursor.selectPath("declare namespace wp='http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' .//wp:anchor/wp:extent");
      while (cursor.hasNextSelection()) {
       cursor.toNextSelection();
       String cx = cursor.getAttributeText(new javax.xml.namespace.QName("cx"));
       result = Integer.valueOf(cx);
      }
      return result;
     }
     
     private static Integer getInlineExtentWidthInEMU(CTDrawing drawing) {
      Integer result = null;
      XmlCursor cursor = drawing.newCursor();
      cursor.selectPath("declare namespace wp='http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' .//wp:inline/wp:extent");
      while (cursor.hasNextSelection()) {
       cursor.toNextSelection();
       String cx = cursor.getAttributeText(new javax.xml.namespace.QName("cx"));
       result = Integer.valueOf(cx);
      }
      return result;
     }
     
     private static Integer getAnchorExtentHeightInEMU(CTDrawing drawing) {
      Integer result = null;
      XmlCursor cursor = drawing.newCursor();
      cursor.selectPath("declare namespace wp='http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' .//wp:anchor/wp:extent");
      while (cursor.hasNextSelection()) {
       cursor.toNextSelection();
       String cy = cursor.getAttributeText(new javax.xml.namespace.QName("cy"));
       result = Integer.valueOf(cy);
      }
      return result;
     }
     
     private static Integer getInlineExtentHeightInEMU(CTDrawing drawing) {
      Integer result = null;
      XmlCursor cursor = drawing.newCursor();
      cursor.selectPath("declare namespace wp='http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' .//wp:inline/wp:extent");
      while (cursor.hasNextSelection()) {
       cursor.toNextSelection();
       String cy = cursor.getAttributeText(new javax.xml.namespace.QName("cy"));
       result = Integer.valueOf(cy);
      }
      return result;
     }
     
     public static void main(String[] args) throws Exception {
    
      XWPFDocument document = new XWPFDocument(new FileInputStream("WordContainingTextBoxes.docx"));
    
      for (IBodyElement bodyElement : document.getBodyElements()) {
       if (bodyElement instanceof XWPFParagraph) {
        XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
        for(IRunElement runElement : paragraph.getIRuns()) {
         if (runElement instanceof XWPFRun) {
          XWPFRun run = (XWPFRun) runElement;
          List<CTDrawing> drawings = getAllDrawings(run);
          for (CTDrawing drawing : drawings) {
           String textBoxContent = getTextBoxContent(drawing);
           System.out.println("text box content: " + textBoxContent);
           Integer width = getAnchorExtentWidthInEMU(drawing);
           if (width == null) width = getInlineExtentWidthInEMU(drawing);
           System.out.println("anchor or inline width: " + (double)width/Units.EMU_PER_INCH + " inch");
           Integer height = getAnchorExtentHeightInEMU(drawing);
           if (height == null) height = getInlineExtentHeightInEMU(drawing);
           System.out.println("anchor or inline height: " + (double)height/Units.EMU_PER_INCH + " inch");
          }
         }
        }
       }
      }
    
      document.close();
     }
    }
    

    To get the position of the shapes will be much more expensive. The first question is, position in relation to what? Inline shapes act as characters in text flow. Therefore the position depends on text before the inline shape, including font size, character spacing, line spacing, automatic line wraps and so on. But also anchored shapes can be anchored to different text and page elements. They can be anchored to a paragraph, to the page borders, to the leaf borders, … Me not able to show code for all this here. You see what you pay for if you pay for Aspose Words?