javascripthtmlweb-scraping

scrape parent page html from iframe


I have an iframe which is used to generate a PDF from its parent page. The PDF maker (ABCpdf) requires an HTML file which it then converts.

What I do at present is scrape the parent's HTML using:

var temp;
temp=parent.document.body.parentNode.innerHTML;

then I use the form in the iframe to submit it to the server where it is massaged to remove things like the iframe sections before being saved as a temporary HTML file for the PDF maker.

However the resulting HTML code is mangled, with <BODY> instead of <body> etc and the quotes around IDs removed etc.

Is there a better way to grab the HTML?

The reason I don't just regenerate the page as HTML is that the parent page is a complex report. It contains various controls to allow the user to show/hide sections or sort rows in tables. So the HTML I get has to reflect the user customisations.

thanks


Solution

  • As David mentioned, using innerHTML, you're pretty much at the browser's mercy. If you want to have control over serialization, you could just walk the DOM of the parent document yourself, appending string representation of nodes to a buffer. This will take longer and involve more code, but will result in full control over the output.

    Something like this (pseudocode):

    function serializeAttributes(node, buffer) {
      for (attribute in node.attributes) {
        buffer.append(' ' + attribute.name + '="' + attribute.value + '"');
      }
    }
    
    function serializeChildren(node, buffer) {
      for (child in node.childNodes) {
        if (child is a text node) {
          buffer.append(child.value);
        } else if (child is an element) {
          // You can also add checks to avoid going into IFrames, etc.
          serializeElement(child, buffer);
        }
      }
    }
    
    function serizalizeElement(node, buffer) {
      buffer.append('<' + node.tagName); 
      serializeAttributes(node, buffer);
      if (node.hasChildren) {
        buffer.append('>');
        serializeChildren(node, buffer);
        buffer.append('</' + node.tagName + '>');
      } else {
        buffer.append('\>');
      }
    }
    
    serializeNode(window.parent.document);