javascriptnode.jsiframepuppeteer

How to recursively get iframe content from puppeteer


I'm trying to get multi-depth iframe content using puppeteer.

Examples of multiple depth iframes:

top.html:

<html>
  <title>top</title>
  <body>
    <p>top text</p>

    <iframe src="1.html"></iframe>

    <hr />

    <iframe src="2.html"></iframe>
  </body>
</html>

1.html:

<html>
  <title>1</title>
  <body>
    <p>1 text</p>

    <iframe src="1-1.html"></iframe>

  </body>
</html>

1-1.html:

<html>
  <title>1-1</title>
  <body>
    <p>1-1 text</p>
  </body>
</html>

2.html:

<html>
  <title>2</title>
  <body>
    <p>2 text</p>
  </body>
</html>

My end goal is to get a HTML string like this:

<html>
  <title>top</title>
  <body>
    <p>top text</p>

    <iframe>

      <html>
        <title>1</title>
        <body>
          <p>1 text</p>

          <iframe>
            <html>
              <title>1-1</title>
              <body>
                <p>1-1 text</p>
              </body>
            </html>
          </iframe>

        </body>
      </html>

    </iframe>

    <hr />

    <iframe>
      <html>
        <title>2</title>
        <body>
          <p>2 text</p>
        </body>
      </html>
    </iframe>

  </body>
</html>

And the presence or location of iframe, html, and body tags is not very important. So the following also good for me:

<p>top text</p>

  <p>1 text</p>

    <p>1-1 text</p>

  <p>2 text</p>

After a lot of trial and error, I had some success at single depth:

import { launch } from 'puppeteer';

(async () => {
  const browser = await launch({
    headless: 'new',
    args: [
      '--disable-web-security',
      '--disable-features=IsolateOrigins,site-per-process'
    ]
  });
  const page = await browser.newPage();
  await page.goto('file:///C:/test/src/top.html', { waitUntil: 'networkidle0' });

  const iframes = await page.$$("iframe");
  for (const iframe of iframes) {
    const frame = await iframe.contentFrame();
    if (!frame) continue;
    const context = await frame.executionContext();
    const res = await context.evaluate(() => document.querySelector("*").outerHTML);
    if (res) {
      await iframe.evaluate((a, res) => {
        a.insertAdjacentHTML('afterend', res);
        a.remove();
      }, res);
    }
  }

  const htmlContent = await page.content();
  console.log(htmlContent);

})();

This only worked at a single depth.

And I've been unsuccessful in trying to fix this recursively.

In particular, the difference between evaluate inside and outside is not fully understood.

And I expect there will be an easier way, in a completely different way than I tried.

I think there will be a lot of cases of getting all the information of a specific url without omission.


Solution

  • After writing hundreds of Puppeteer scripts, one thing I've come to realize is that it's easier to write browser console code than it is to work with element handles. If you don't need trusted events, you can treat Puppeteer as a thin wrapper that lets you programmatically run native, vanilla console code.

    Your goal can be achieved with handles, but I'd just do it in the browser where you can work directly with the DOM synchronously.

    Here's a sketch. It's not perfect, but should give you a reasonable starting point to tweak for your use case. You can use outerHTML in place of innerHTML or try to add the <html> root using techniques in other threads as needed.

    const puppeteer = require("puppeteer"); // ^20.2.0
    
    // url for testing; I ran `python -m http.server 8002`
    const url = "http://localhost:8002/top.html";
    
    let browser;
    (async () => {
      browser = await puppeteer.launch();
      const [page] = await browser.pages();
    
      // networkidle0 waits for the iframes to load
      await page.goto(url, {waitUntil: "networkidle0"});
      const html = await page.evaluate(() => {
        const walk = doc => {
          const iframeHTML = [...doc.querySelectorAll("iframe")].map(
            e => walk(e.contentDocument)
          );
          const dom = new DOMParser().parseFromString(
            doc.body.innerHTML,
            "text/html"
          );
          dom
            .querySelectorAll("iframe")
            .forEach((e, i) => (e.innerHTML = iframeHTML[i]));
          return dom.documentElement.innerHTML;
        };
        return walk(document);
      });
      console.log(html);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    Output (run through Prettier with --parser html if it's hard to read):

    <head></head><body><p>top text</p>
    
        <iframe src="1.html"><head></head><body><p>1 text</p>
    
        <iframe src="1-1.html"><head></head><body><p>1-1 text</p>
      
    
    </body></iframe>
    
      
    
    </body></iframe>
    
        <hr>
    
        <iframe src="2.html"><head></head><body><p>2 text</p>
      
    
    </body></iframe>
      
    
    </body>
    

    Here's the above algorithm in pseudocode:

    The base case is a document with no children. It just passes its HTML up directly for its parent to start filling in its iframes with.


    Caveat emptor: It's pretty seldom that one's goal in web scraping is to get all of the HTML content, so if you're using this as a sub-step you assume must be necessary to achieve a larger goal, be careful not to fall into an XY problem. 99.9% of the time, this shouldn't be necessary to do in a typical web scraping or testing situation.

    Also, there are no silver bullets in web scraping, so I imagine this will break on a good deal of sites for various surprising reasons.

    Disclosure: I'm the author of the linked blog post.