javaxmldocxdocx4jdoc

Issue loading .XML files with doc4J in Java


I am facing an issue where I cannot load even the sample word2003xml.xml which is provided by doc4J for tests in docx4j-samples-docx4j-8.3.1.zip found here https://www.docx4java.org/downloads.html

I tried loading the file using 2 different constructors but the result is the same.

WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new FileInputStream(new File("C:\\Mine\\project4tests\\word2003xml.xml")));
WordprocessingMLPackage wordMLPackage2 = WordprocessingMLPackage.load(new java.io.File("C:\\Mine\\project4tests\\word2003xml.xml"));

Here is the exception that I am getting:

Exception in thread "main" org.docx4j.openpackaging.exceptions.Docx4JException: Couldn't load xml from stream 
    at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:641)
    at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:418)
    at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:376)
    at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:341)
    at org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(WordprocessingMLPackage.java:182)
    at Main.main(Main.java:13)
Caused by: javax.xml.bind.UnmarshalException
  with linked exception:
[com.sun.istack.internal.SAXParseException2; lineNumber: 3; columnNumber: 827; unexpected element (uri:"http://schemas.microsoft.com/office/word/2003/wordml", local:"wordDocument"). Expected elements are <{http://schemas.microsoft.com/office/2006/xmlPackage}package>,<{http://schemas.microsoft.com/office/2006/xmlPackage}part>,<{http://schemas.microsoft.com/office/2006/xmlPackage}xmlData>]
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.handleStreamException(UnmarshallerImpl.java:468)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:402)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:371)
    at org.docx4j.convert.in.FlatOpcXmlImporter.<init>(FlatOpcXmlImporter.java:132)
    at org.docx4j.openpackaging.packages.OpcPackage.load(OpcPackage.java:638)
    ... 5 more
Caused by: com.sun.istack.internal.SAXParseException2; lineNumber: 3; columnNumber: 827; unexpected element (uri:"http://schemas.microsoft.com/office/word/2003/wordml", local:"wordDocument"). Expected elements are <{http://schemas.microsoft.com/office/2006/xmlPackage}package>,<{http://schemas.microsoft.com/office/2006/xmlPackage}part>,<{http://schemas.microsoft.com/office/2006/xmlPackage}xmlData>
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext.handleEvent(UnmarshallingContext.java:726)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.Loader.reportError(Loader.java:247)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.Loader.reportError(Loader.java:242)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.Loader.reportUnexpectedChildElement(Loader.java:109)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext$DefaultRootLoader.childElement(UnmarshallingContext.java:1131)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext._startElement(UnmarshallingContext.java:556)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext.startElement(UnmarshallingContext.java:538)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.InterningXmlVisitor.startElement(InterningXmlVisitor.java:60)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StAXStreamConnector.handleStartElement(StAXStreamConnector.java:231)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.StAXStreamConnector.bridge(StAXStreamConnector.java:165)
    at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:400)
    ... 8 more
Caused by: javax.xml.bind.UnmarshalException: unexpected element (uri:"http://schemas.microsoft.com/office/word/2003/wordml", local:"wordDocument"). Expected elements are <{http://schemas.microsoft.com/office/2006/xmlPackage}package>,<{http://schemas.microsoft.com/office/2006/xmlPackage}part>,<{http://schemas.microsoft.com/office/2006/xmlPackage}xmlData>
    ... 19 more

There is no issue loading a .DOCX file, however what I need to use the docx4J library is to convert an old .DOC (WordprocessingML more like an .XML) file into a .DOCX. Similar to what is done here https://coderanch.com/t/721499/java/Word-XML-DOCX

Does anybody know why I cannot load the file properly?


Solution

  • See https://github.com/plutext/docx4j/blob/master/docx4j-core/src/main/java/org/docx4j/convert/in/word2003xml/Word2003XmlConverter.java for 2003 XML files.

    Note that .doc is the old binary format; its not XML, it is something different again.