javaxhtmldocxdocx4jwordml

docx4j replace placeholder with (x)html converted into WordML, but in result document I see WordML markup tags


I have a code that replaces placeholders like ${NAME} to the plain text. I use docx4j and docx4j-search-and-replace-util for replacing placeholders.

It works fine, but now in the one of fields "APP_ADDITIONAL_INFO" I need to replace placeholder with a simple formatted HTML from Quill editor like: <p><strong>Header</strong></p><p><strong>Text string</strong></p> And a result .docx document contains this html instead of formatted text for this field.

I studied this issue and realized that for this purpose it is necessary to use docx4j-ImportXHTML and JTidy. With JTidy I can convert my HTML to XHTML, and then ImportXHTML converts XHTML into WordML format.

But now in the result .docx I see the full WordML markup instead of formatted text, starting from

<w:document xmlns:dsp="http://schemas.microsoft.com/office/drawing/2008/diagram" and so on, like <w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/><w:b/><w:i w:val="false"/><w:color w:val="000000"/><w:sz w:val="22"/></w:rPr><w:t>Formatted text string</w:t></w:r>

So, where I am wrong?

My code is:

    WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
    XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
    
    BufferedReader br = new BufferedReader(new StringReader(doc.getDescription()));

    StringWriter sw = new StringWriter();
    Tidy t = new Tidy();
    t.setDropEmptyParas(true);
    t.setShowWarnings(false); //to hide errors
    t.setQuiet(true); //to hide warning
    t.setUpperCaseAttrs(false);
    t.setXmlOut(true);
    t.setUpperCaseTags(false);
    t.setInputEncoding("UTF-8");
    t.setOutputEncoding("UTF-8");
    t.setXmlOut(true);
    t.parse(br,sw);
    StringBuffer sb = sw.getBuffer();
    String strClean = sb.toString();
    br.close();
    sw.close();

    wordMLPackage.getMainDocumentPart().getContent().addAll(XHTMLImporter.convert( strClean, null) );
    // the variable that should contain WordML markup
    String description = XmlUtils.marshaltoString(wordMLPackage.getMainDocumentPart().getJaxbElement(), true, true);
    
    // map with placeholders and replacing data
    Map<String, String> replaceMap = new HashMap<String, String>() {{
        put("${APP_EMPLOYEE}",              doc.getEmployeeName());
        put("${APP_JOB_TITLE}",             doc.getJobtitle());
        put("${APP_ADDITIONAL_INFO}",       description);
    }};

    byte[] cos = gt.generateDocXDocument(filePath, replaceMap, masterId);

    return ResponseEntity.ok()
            .header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=\"test.docx\"").body(cos);

In generateDocXDocument method:

    generateDocXDocument(String filePath, Map <String, String> replaceMap){
      byte[] decryptedBytesOfFile = storageService.loadFile(filePath);
      WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new ByteArrayInputStream(decryptedBytesOfFile));
      Docx4JSRUtil.searchAndReplace(wordMLPackage, replaceMap);
      OutputStream outputStream = new ByteArrayOutputStream();
      Save saver = new Save(wordMLPackage);
      saver.save(outputStream);
      return ((ByteArrayOutputStream) outputStream).toByteArray();
    }

I even just tried to not to use docx4j-ImportXHTML and JTidy, and simply change placeholder with WordML markup like:

    put("${APP_ADDITIONAL_INFO}", "<w:r><w:rPr><w:rFonts w:ascii=\"Times New Roman\" w:hAnsi=\"Times New Roman\"/><w:b/><w:i w:val=\"false\"/><w:color w:val=\"000000\"/><w:sz w:val=\"22\"/></w:rPr><w:t>Formatted text</w:t></w:r>");

but the result is the same - resulted .docx file contains this markup.


Solution

  • As you've found, that's not going to work. You can't replace text with markup that way.

    A better approach is to use content control databinding; see Replace a content control with an HTML value while generate document using docx4J

    https://github.com/plutext/docx4j/blob/VERSION_11_5_0/docx4j-core/src/main/java/org/docx4j/model/datastorage/migration/FromVariableReplacement.java can be used to convert placeholders/variables to content controls.