htmlregexurlweb-scrapingw3c

How to find all (possibly relative) urls on a website?


As a programming exercise, I'm crafting a small python tool to download a whole website locally. To be able to browse the website locally, I need to translate all URLs to relative URLs. Otherwise, resources files (.js, .css) would be downloaded from the original website instead of using the locally downloaded version. And since I need to rewrite URLs, I figured I can also change the files hierarchy. This leads to this slightly more general question:

How can I find all URLs in a website? A regex based on http://domain.tld/path won't cut it because an href attribute might contain a relative URL.

So far, I have identified the followings:

HTML

CSS

JS

[EDIT] See also this post for some regexes to find urls. Incomplete as srcset is not used there.


Solution

  • Maybe a good start ?

    mech-dump --links 'https://stackoverflow.com/questions/62313765
    

    Retrieve:

    This command is installed with the module: WWW::Mechanize

    Package libwww-mechanize-perl for Debian based distros