javams-wordapache-poi

Convert Microsoft Word docx to html using Apache POI


I have to convert word docx files to html keeping the html format exactly the same as if it was converted by word itself. It will run in a production server where Office (Microsoft, OpenOffice, LibreOffice,...) are not allowed to be installed.

I'm using Apache POI.

As far as now I can get all elements of the docx document. The XWPFParagraph has a parameter where has the style used in the Microsoft Word document. I can access to the font-family, font-size and other parameters values in the Paragraph object but there are some definitions in the style that are not in the Paragraph nor in the Runs objects.

I check on the internet and here and the suggestions are to use the properties in the Paragraph and the Run objects. But I cannot find, for example, a bold for characters that is defined in the style and not in the Paragraph nor in the Run.

I tried several libraries, but the conversion has not the quality we need (DocConv, iText, Docx4j, JRTF, OpenSagres, PDFtoHTML, ...

I'm also developing unzipping the docx to xml and using SAX, but I think it´s a better job if I can do it all with POI.

How can I get the list of Styles and their properties (name, alignment, font Family, Font size, ...) using POI?

Thanks for any help.

My code so far is this:

public class Start2 {

    public static void main(String[] args) throws InvalidFormatException, IOException {
        String inputDoc = "D:/TestOutput/OriginalDoc.docx";
        String styleP, styleR, auxTexto;
        
        boolean div = false;
        
        String htmlStart = "<html>\r\n";
        String htmlHead = "<meta http-equiv='Content-Type' content='text/html; charset=windows-1252'>\r\n" +
                        "<meta name='viewport' content='width=device-width, initial-scale=1.0'>\r\n";       
        String htmlFilename = "D:/TestOutput/TestOutput.html";
        File htmlFile = new File(htmlFilename);
        BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(htmlFile, false), StandardCharsets.UTF_8));
        
        FileInputStream fis = new FileInputStream(inputDoc);
        XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
        List<XWPFParagraph> paragraphs = xdoc.getParagraphs();
        
        bw.write(htmlStart + "\r\n");
        bw.write("<head>\r\n");
        bw.write(htmlHead);
        bw.write("<style>\r\n");
        bw.write("html {font-family:Arial;}\r\n");  // margin-left: 30px; margin-right: 30px; 
        
        // ???
        
        bw.write("</style>\r\n");
        bw.write("</head>\r\n");
        bw.write("<body>\r\n");
        
        for(XWPFParagraph para : paragraphs) {
            if (para.getStyle() != null) {
                bw.write("<p class='" + para.getStyle() + "'");
                div = true;
            } else {
                bw.write("<p");
                div = false;
            }
            styleP = "style='";
            if (para.getFirstLineIndent() > 0) {
                styleP+= "text-indent:" + para.getFirstLineIndent() +";";
            }
            
            if(para.getAlignment() != null) {
                auxTexto = para.getAlignment().name().toLowerCase();
                if (auxTexto.equalsIgnoreCase("both")) {
                    auxTexto = "justify";
                }
                styleP+= "text-align: " + auxTexto + ";";
            }
            
            if (!"style='".equals(styleP)) {
                bw.write(" " + styleP + "'>");
            } else {
                bw.write(">");
            }
            
            for(XWPFRun run : para.getRuns()) {
                
                styleR = " style='";
                if (run.getFontFamily() != null) {
                    styleR+= "font-family:" + run.getFontFamily() + ";";
                }
                
                
                bw.write(run.text());
//                System.out.println(run.text() + " Images: " + run.getEmbeddedPictures().size());
            }
            if (div) {
                bw.write("</p>\r\n");
            } else {
                bw.write("</p>");
            }
        }
        bw.write("</body>\r\n");
        bw.write("</html>");
        bw.close();
        System.out.println("END PGM");
    }
    
}

Solution

  • You can access Word styles in Apache POI using the XWPFStyles class from the document. Here's a short explanation: