docx4j

Namespace disappearing on save of cloned/edited docx, leading to "Unreadable content" message in Word


We use docx4j (docx4j-JAXB-MOXy v11.4.9) to add images to a docx then pass it through another system (only-office) to edit it. If I load the docx, clone it, clear the body, copy some of the original content back in then save it, one of the namespaces (now needed for images) is removed. So when you open in word of course you get the "Unreadable content" message.

The saved document is missing this namespace from the original at the top of document.xml :

xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"

Excerts from document.xml showing how the wpg namespace is used to add a fallback image:

<w:r>
  <w:rPr>
    <w:color w:val="000000"/>
  </w:rPr>
  <w:t xml:space="preserve">Image :</w:t>
  <mc:AlternateContent>
    <mc:Choice Requires="wpg">
      <w:drawing>
...
      </w:drawing>
    </mc:Choice>
    <mc:Fallback>
      <w:pict>
...
      </w:pict>
    </mc:Fallback>
  </mc:AlternateContent>
</w:r>

Example code to reproduce from an original containing that namespace:

InputStream mergedDocStream = new BufferedInputStream(new FileInputStream("c:\\tmp\\original.docx"), 8 * 1024 * 1024);
WordprocessingMLPackage mergedDoc = WordprocessingMLPackage.load(mergedDocStream);
mergedDocStream.close();
List<Object> mergedDocContentList = mergedDoc.getMainDocumentPart().getContent();
    
// get the body section (contains styles, orientation, headers/footers) which needs adding to each doc
SectPr finalSectionProperties = mergedDoc.getMainDocumentPart().getJaxbElement().getBody().getSectPr();
    
// Create a blank target using the merged file
ObjectFactory objectFactory = new ObjectFactory();
mergedDoc.getMainDocumentPart().getJaxbElement().setBody(objectFactory.createBody());
WordprocessingMLPackage outputDoc = (WordprocessingMLPackage) mergedDoc.clone();
    
// loop through the original doc's contents and add
for (Object content : mergedDocContentList) {
  outputDoc.getMainDocumentPart().addObject(content);
}
    
// add body section for original styles etc
outputDoc.getMainDocumentPart().getJaxbElement().getBody().setSectPr(finalSectionProperties);
    
// save last doc    
File outputFile = new File("c:\\tmp\\output.docx");
outputDoc.save(outputFile);

Is there a way we can convince it to keep that namespace?


Solution

  • By way of background, JAXB automatically declares required namespaces.

    The ones specified in @Requires are different, since although Word needs them, they aren't required in an XML spec sense.

    So docx4j keeps track of these as they are encountered during unmarshalling (see Docx4jUnmarshallerListener). But if the @Requires attribute isn't present in the docx at that time, this can't be done.

    JaxbXmlPart contains:

    /**
     * Specify a namespace prefix (used in mc:Choice/@Requires) which
     * docx4j should declare on the top-level element of the part
     * (otherwise Microsoft Office won't be able to open the file).  
     * 
     * Specify a prefix (eg 'wpg') as opposed to the namespace itself.
     * 
     * This is often done automatically (see further McIgnorableNamespaceDeclarator),
     * but where it isn't, you should invoke this method directly
     * from your code.
     * 
     * @param mcChoiceNamespace
     */
    public void addMcChoiceNamespace(String mcChoiceNamespace) {
        this.mcChoiceNamespaces.add(mcChoiceNamespace);
    }
    

    In your case,

    outputDoc.getMainDocumentPart().addMcChoiceNamespace("wpg");
    

    should do the trick.