javaescapingdocx4jhtml-escape-charactershtml-escape

How to preserve the HTML when creating a .docx with docx4j?


I've started today using docx4j;

I've succesfully created a document with a table, fed with content coming from an external source.

This content has simple HTML inside, for example a column may contain a String like:

String content = "Hello&nbsp;<strong>Word</strong><br>";

If I put this String in the column with the createParagraphOfText() method:

Tc tableCell = factory.createTc();    
tableCell.getContent().add(
    wordMLPackage.getMainDocumentPart().createParagraphOfText(content)
);
tableRow.getContent().add(tableCell);

it is rendered as-is in the Word document (as expected):

Hello&nbsp;<strong>Word</strong><br>

What I'm trying to achieve is to place in the document the rendered HTML, to get the following output:

Hello Word


I've searched on StackOverflow and the Web, and tried almost all of the examples found, but the informations are quite fragmented, and before digging more deeply I would like to know at least if I'm in the right direction.

I've added the docx4j-ImportXHTML jar to Maven, but in the docs it states that the content must be a well-formed XHTML, while I have only a bunch of text and HTML mixed together.

Also many of the (few) examples using it consist of taking an existing XML file to convert it to docx, while I'm good with fully creating the docx manually, and only need to render a single String containing HTML. Is it possible with this module ?

I've also seen that there are other docx4j modules (eg. xhtmlrenderer), but I'm not sure about which is the good one.

Does someone know the right procedure to add chunks of HTML in a table('s cell) during an iteration ?


Solution

  • You have a choice to make:

    Doing it yourself gives you greater control, and means downstream processing will work (eg convert to PDF) without having to open the docx in Word first.

    Letting Word do it is the AlternativeFormatInputPart (altChunk) approach.

    My advice would be to do it yourself if you can. And I'd suggest you use docx4j-ImportXHTML for that.

    I've added the docx4j-ImportXHTML jar to Maven, but in the docs it states that the content must be a well-formed XHTML, while I have only a bunch of text and HTML mixed together.

    You can use one of the "tidy" libraries to convert to XHTML. Since there are quite a few of these, we leave which you use and how you configure it up to you.

    only need to render a single String containing HTML. Is it possible with this module ?

    ConvertInXHTMLFragment.java is an example.

    I've also seen that there are other docx4j modules (eg. xhtmlrenderer), but I'm not sure about which is the good one.

    docx4j-ImportXHTML is dependent on that.