javaalgorithmsyntaxdocx4jarabic-support

How to convert HTML to well formed DOCX with styling attributes intact


I am trying to convert HTML5 file to docx using docx4j. The bigger picture is that the HTML contains Arabic data and English data. I have set styling on elements in my HTML. My HTML looks neat on chrome but when I convert to docx using docx4j, arabic text formatting is lost. On MS word, it shows that my Arabic text has bold style set, but it isn't bold. Similarly, RTL directions are also lost. Tables are reversed from RTL to LTR. As a workaround, I used BufferedWriter to generate .doc file, which matched my HTML file with styling attributes but there is Base64 image present in the html, which doesn't appear in the .doc file. Hence, the need to convert to .docx format. My requirement is an editable document generated from my HTML. Please guide me through as I have been scratching my head. No source example codes are working as well.

Here is the code I am using to convert HTML to docx.

public boolean convertHTMLToDocx(String inputFilePath, String outputFilePath, boolean headerFlag,
        boolean footerFlag,String orientation, String logoPath, String margin, JSONObject json,boolean isArabic) {
    boolean conversionFlag;
    boolean orientationFlag = false;
    try {
        if(!orientation.equalsIgnoreCase("Y")){
            orientationFlag = true;
        }
        String stringFromFile = FileUtils.readFileToString(new File(inputFilePath), "UTF-8");
        String unescaped = stringFromFile;
        WordprocessingMLPackage wordMLPackage  = WordprocessingMLPackage.createPackage();
        NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
        wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
        ndp.unmarshalDefaultNumbering();

        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.Bidi.Heuristic", true);
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.Element.Heading.MapToStyle", true);
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.fonts.default.serif", "Frutiger LT Arabic 45 Light");
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.fonts.default.sans-serif", "Frutiger LT Arabic 45 Light");
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.fonts.default.monospace", "Frutiger LT Arabic 45 Light");

        XHTMLImporterImpl xHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
        xHTMLImporter.setHyperlinkStyle("Hyperlink");
        xHTMLImporter.setParagraphFormatting(FormattingOption.CLASS_PLUS_OTHER);
        xHTMLImporter.setTableFormatting(FormattingOption.CLASS_PLUS_OTHER);
        xHTMLImporter.setRunFormatting(FormattingOption.CLASS_PLUS_OTHER);

        wordMLPackage.getMainDocumentPart().getContent().addAll(xHTMLImporter.convert(unescaped, ""));

        XmlUtils.marshaltoString(wordMLPackage.getMainDocumentPart().getJaxbElement(),true,true);
        File output = new File(outputFilePath);

        wordMLPackage.save(output);

        Console.log("file path where it is stored is" + " " + output.getAbsolutePath());
        if (headerFlag || footerFlag) {
            File file = new File(outputFilePath);
            InputStream in = new FileInputStream(file);

            wordMLPackage = WordprocessingMLPackage.load(in);
            if (headerFlag) {
                // set Header 
            }
            if (footerFlag) {
                // set Footer
            }

            wordMLPackage.save(file);
            Console.log("Finished editing the word document");
        }
        conversionFlag = true;
    } catch (InvalidFormatException e) {
        Error.log("Invalid format found:-" + getStackTrace(e));
        conversionFlag = false;
    } catch (Exception e) {
        Error.log("Error while converting:-" + getStackTrace(e));
        conversionFlag = false;
    }

    return conversionFlag;
}

Solution

  • Here is how I approached it. It is not the best approach but yes I have seen this being implemented in organizations. In those approaches, they create war files on Application Servers for hosting static and dynamic content to HTTP Requests.

    So, I used a simple byte array being written to .doc file instead of .docx. That way, the final word document will appear exactly the same as html. The only issue I faced was that binary images were not getting displayed. Only a box was appearing in place of image.

    So, I wrote two files:

    1st- Read all my binary image tags from html file and used Base64 decoder to decode the images. Save all the decoded images on the disk on my server host, created the path to that file, and replaced the src attribute of all such img tags in html with this location on disk. (The new location was preceded with http://{remote_server}:{remote_port}/{war_deployment_descriptor}/images/<disk_path_where_image_was_stored>

    2nd- I created a simple servlet in my war file deployed on server which listened to get requests on /images and upon receiving get requests with path names, returned the image on OutputStream.

    Voila, the images started coming.

    Disclaimer- These images will however not be visible outside of your network. I was lucky to have strict adherence to customer's network only. To get them available outside network, you may request your IT team to allow the path of the serving the images either on the open network or on the network you want the availability for. The problem will be solved.

    Edit - You can create a new war file for hosting these images or use the one which is generating these images.


    My experience- For English documents go for .docx conversion using docx4j. For Arabic or hebrew or other RTL languages go for .doc conversion as above. All such .doc documents can then be easily converted to .docx as well from MS Word.

    Listing the two files, please change as per your need:

    File1.java

            public static void writeHTMLDatatoDoc(String content, String inputHTMLFile,String outputDocFile,String uniqueName) throws Exception {
                String baseTag = getRemoteServerURL()+"/{war_deployment_desciptor}/images?image=";
                String tag = "Image_";
                String ext = ".png";
                String srcTag = "";
                String pathOnServer = getDiskPath() + File.separator + "TemplateGeneration"
                        + File.separator + "generatedTemplates" + File.separator + uniqueName + File.separator + "images" + File.separator;
        
                int i = 0;
                boolean binaryimgFlag = false;
        
                Pattern p = Pattern.compile("<img [^>]*src=[\\\"']([^\\\"^']*)");
                Matcher m = p.matcher(content);
                while (m.find()) {
                    String src = m.group();
                    int startIndex = src.indexOf("src=") + 5;
                    int endIndex = src.length();
                    
                    // srcTag will contain data as .........
                    // Replace this whole later with path on local disk
                    srcTag = src.substring(startIndex, src.length());
                    
                    if(srcTag.contains("base64")) {
                        binaryimgFlag = true;
                    }
                    if(binaryimgFlag) {
                        
                        // Extract image mime type and image extension from srcTag containing binary image
                        ext = extractMimeType(srcTag);
                        if(ext.lastIndexOf(".") != -1 && ext.lastIndexOf(".") != 0)
                            ext = ext.substring(ext.lastIndexOf(".")+1);
                        else 
                            ext = ".png";
                        
                        // read files already created for the different documents for this unique entity.
                        // The location contains all image files as Image_{i}.{image_extension}
                        // Sort files and read max counter in image names. 
                        // Increase value of i to generate next image as Image_{incremented_i}.{image_entension}
                        i = findiDynamicallyFromFilesCreatedForWI(pathOnServer);
                        i++; // Increase count for next image
                        
                        // save whole data to replace later
                        String srcTagBegin = srcTag; 
                        
                        // Remove data:image/png;base64, from srcTag , so I get only encoded image data.
                        // Decode this using Base64 decoder.
                        srcTag = srcTag.substring(srcTag.indexOf(",") + 1, srcTag.length());
                        byte[] imageByteArray = decodeImage(srcTag);
                        
                        // Constrcu replacement tag
                        String replacement = baseTag+pathOnServer+tag+i+ext;
                        replacement = replacement.replace("\\", "/");
        
                        // Writing image inside local directory on server
                        FileOutputStream imageOutFile = new FileOutputStream(pathOnServer+tag+i+ext);
                        imageOutFile.write(imageByteArray);
                        content = content.replace(srcTagBegin, replacement);
                        imageOutFile.close();
                    }
                }
                
                //Re write HTML file
                writeHTMLData(content,inputHTMLFile);
        
                // write content to doc file
                writeHTMLData(content,outputDocFile);
            }
        
            public static int findiDynamicallyFromFilesCreatedForWI(String pathOnServer) {
                String path = pathOnServer;
                int nextFileCount = 0;
                String number = "";
                String[] dirListing = null;
                File dir = new File(path);
                dirListing = dir.list();
                if(dirListing.length != 0) {
                    Arrays.sort(dirListing);
                    int length = dirListing.length;
                    int index = dirListing[length - 1].indexOf('.');
                    number = dirListing[length - 1].substring(0,index);
                    int index1 = number.indexOf('_');
                    number = number.substring(index1+1,number.length());
                    nextFileCount = Integer.parseInt(number);
                }
                return nextFileCount;
            }
        
            private static String extractMimeType(final String encoded) {
                final Pattern mime = Pattern.compile("^data:([a-zA-Z0-9]+/[a-zA-Z0-9]+).*,.*");
                final Matcher matcher = mime.matcher(encoded);
                if (!matcher.find())
                    return "";
                return matcher.group(1).toLowerCase();
            }
        
            private static void writeHTMLData(String inputData, String outputFilepath) {
                BufferedWriter writer = null;
                try {
                    writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(outputFilepath)), Charset.forName("UTF-8")));
                    writer.write(inputData);
                } catch (IOException e) {
                    e.printStackTrace();
                } finally {
                    try {
                        if(writer != null)
                            writer.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
            }
        
            public static byte[] decodeImage(String imageDataString) {
                return Base64.decodeBase64(imageDataString);
            }
        
            private static String readHTMLData(String inputFile) {
                String data = "";
                String str = "";
        
                try (BufferedReader reader = new BufferedReader(
                        new InputStreamReader(new FileInputStream(new File(inputFile)), StandardCharsets.UTF_8))) {
                    while ((str = reader.readLine()) != null) {
                        data += str;
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
                return data;
            }
    

    File2.java

     import java.io.File;
     import java.io.IOException;
     import java.nio.file.Files;
     
     import javax.servlet.ServletException;
     import javax.servlet.http.HttpServlet;
     import javax.servlet.http.HttpServletRequest;
     import javax.servlet.http.HttpServletResponse;
     import com.newgen.clos.logging.consoleLogger.Console;
     public class ImageServlet extends HttpServlet {
         public void init() throws ServletException {
         public ImageServlet() {
             super();
         }
     
         protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
             String param = request.getParameter("image");
             Console.log("Image Servlet executed");
             Console.log("File Name Requested: " + param);
             param.replace("\"", "");
             param.replace("%20"," ");
             File file = new File(param);
             response.setHeader("Content-Type", getServletContext().getMimeType(param));
             response.setHeader("Content-Length", String.valueOf(file.length()));
             response.setHeader("Content-Disposition", "inline; filename=\"" + param + "\"");
             Files.copy(file.toPath(), response.getOutputStream());
         }
     }