pythonseleniumweb-scrapingwgethttrack

Retrieving a complete webpage including dynamically loaded links/images


Problem

Downloading a complete working offline copy of a website that loads links/images dynamically

Research

There are questions (e.g. [1], [2], [3]) on Stackoverflow addressing this issue, most of which have the top answers using wget or httrack, both of which fail miserably (please do correct me if I am wrong) on pages that dyanmically load links or uses srcset instead of src for img tag -or anything loaded via JS-. A rather obvious solution was Selenium, however, if you ever used Selenium in production, you quickly start seeing the issues that arise from such a decision (resource heavy, quite complex to use head-full driver, the fact that is it not built for that), that being said, there are people claiming to have been using it easily in production for years

Expected Solution

A script (preferably in python), that parses the page for links and loads them separately. I cannot seem to find any existing scripts that do that. If your solution is "so implement your own", then it is pointless to be asking the question in the first place, I am seeking an existing implementation.

Examples

  1. Shopify.com
  2. Websites built using Wix

Solution

  • Now there are head-less versions of Selenium and alternatives such as PhantomJS, either can be used with a small script to scrape any dynamically loaded website.

    I had implemented a generic scraper here, and explained more about the topic here