pdf-generationitexttagged-pdf

How to create a tagged PDF from a "complex" XML file


I have a complex XML document. I have used the iText library to create a tagged PDF out of this XML document. I have referred to the examples in the 15th chapter of the iText in Action book but they are confined to a simple XML file having a hierarchy which is just one-level deep.

How can I extend my algorithm that works with the flat structure so that it can handle such hierarchical XML, such as in the below example?

Sample "Complex" XML document :

<?xml version="1.0" encoding="UTF-8" ?>
   <movies>
      <movie duration="141" imdb="0062622" year="1968">
          <title>2001: A Space Odyssey</title>
          <directors>
              <director>Kubrick, Stanley</director>
          </directors>
          <countries>
              <country>United Kingdom</country>
              <country>United States</country>
          </countries>
       </movie>
    </movies>

Solution

  • My teammate came up with a solution to this problem. The idea is to create a Tree of DefaultMutableTreeNode elements. Each of the DefaultMutableTreeNode would contain a PdfStructureElement. The tree should represent the XML hierarchy, for instance, consider the XML code snippet in the previous comment. The first DefaultMutableTreeNode should have a PdfStructureElement(PdfName - movies) whose parent is writer.getStructureTreeRoot(). The child of this node should be another PdfStructureElement(PdfName - movie) whose parent is the PdfStrucutreElement named 'movies' and so on.

    Once the steps mentioned above are completed(which is essentially structure parsing), we get a tree of PdfStrucutreElements. Now, we have to parse the content. As we parse the content, we need to traverse through each of the tree nodes. If the node that is parsed is a leaf node, then we need to get the PdfStructureElement within that node. Else, If the node that is parsed is a non-leaf node, then we need to get the PdfName of the PdfStructureElement within that node. In other words, we can simply use the qName variable.

    if(node is a leaf) 
         PdfStructureElement element=(PdfStructureElement)node.getUserObject();
         canvas.beginMarkedContentSequence(element); 
    else 
         canvas.beginMarkedContentSequence(qName);