web-scrapinggoogle-chrome-extensionbrowser-extension

Clean way to scrape web pages from Manifest V3 chrome extension


My chrome extension scrapes a variety of web pages. I haven't found an approach that fully works yet. What I've tried, that is close:

  1. From the background script, I can fetch, and then run the html through htmlparser2 to parse it (I can't get a document, but for simple extraction this is OK). This is fine for static sites, but doesn't work for sites that render content with javascript.

  2. I can create a tab with extension-supplied html, and in the tab load the targets that I'm attempting to scrape in an iframe (after using declarativeNetRequest to remove X-Frame-Options and related headers). Unfortunately, I then run into same-origin policy, which means that I can't access the content of the iframe - specifically, iframe.contentDocument ends up as null. I tried injecting a script into the iframe using chrome.scripting.executeScript, thinking I could post a message and get it to respond, but I don't have permission to inject scripts on chrome-extension:// tabs, even though it's my own tab! (This seems dumb, but maybe by design.)

I know I could create a new tab per url I want to scrape; however, in order to do that, I'd need a lax contentScripts policy (I have dozens of urls), and I really don't want to be injecting a contentScript into the user's regular browsing tabs (although I will if I find no other solution). Also, the distraction of tabs showing and hiding, or the favicon / title on the tab changing, is pretty poor UX.

Firefox has hidden tabs, which would be nice, but they're not supported in Chrome.

Is there a cleaner approach?


Solution

    1. Use chrome.offscreen API to create a hidden document with access to DOM
    2. Add a rule to strip X-Frame-Options
    3. For each site:
      1. register a content script that runs in the url of the site using chrome.scripting.registerContentScripts with allFrames: true and persistAcrossSessions: false
      2. in the offscreen document create an iframe inside pointing to the site
      3. process its DOM inside your content script
      4. send the results back via messaging
      5. in the offscreen document remove the iframe
      6. unregister the content script

    To make the content script run only inside your iframe:

    1. Add a dummy random id to the URL and use it when registering the content script

      let u = new URL(url);
      u.searchParams.set(Math.random(), '')
      url = u.href;
      

      Theoretically an unknown parameter may be rejected by some site but it's unlikely.

    2. Wrap the entire content script in a condition:

      if (location.ancestorOrigins.contains(chrome.runtime.getURL('').slice(0, -1)) {
         .....
      }