My chrome extension scrapes a variety of web pages. I haven't found an approach that fully works yet. What I've tried, that is close:
From the background script, I can fetch
, and then run the html through htmlparser2 to parse it (I can't get a document, but for simple extraction this is OK). This is fine for static sites, but doesn't work for sites that render content with javascript.
I can create a tab with extension-supplied html, and in the tab load the targets that I'm attempting to scrape in an iframe (after using declarativeNetRequest
to remove X-Frame-Options
and related headers). Unfortunately, I then run into same-origin policy, which means that I can't access the content of the iframe - specifically, iframe.contentDocument
ends up as null. I tried injecting a script into the iframe using chrome.scripting.executeScript
, thinking I could post a message and get it to respond, but I don't have permission to inject scripts on chrome-extension:// tabs, even though it's my own tab! (This seems dumb, but maybe by design.)
I know I could create a new tab per url I want to scrape; however, in order to do that, I'd need a lax contentScripts policy (I have dozens of urls), and I really don't want to be injecting a contentScript into the user's regular browsing tabs (although I will if I find no other solution). Also, the distraction of tabs showing and hiding, or the favicon / title on the tab changing, is pretty poor UX.
Firefox has hidden tabs, which would be nice, but they're not supported in Chrome.
Is there a cleaner approach?
allFrames: true
and persistAcrossSessions: false
To make the content script run only inside your iframe:
Add a dummy random id to the URL and use it when registering the content script
let u = new URL(url);
u.searchParams.set(Math.random(), '')
url = u.href;
Theoretically an unknown parameter may be rejected by some site but it's unlikely.
Wrap the entire content script in a condition:
if (location.ancestorOrigins.contains(chrome.runtime.getURL('').slice(0, -1)) {
.....
}