As a programming exercise, I'm crafting a small python tool to download a whole website locally. To be able to browse the website locally, I need to translate all URLs to relative URLs. Otherwise, resources files (.js, .css) would be downloaded from the original website instead of using the locally downloaded version. And since I need to rewrite URLs, I figured I can also change the files hierarchy. This leads to this slightly more general question:
How can I find all URLs in a website? A regex based on http://domain.tld/path
won't cut it because an href
attribute might contain a relative URL.
So far, I have identified the followings:
HTML
href=<url>
(quoted)src=<url>
(quoted)srcset=<list>
action=<url>
(quoted)background=<url>
(quoted)CSS
url('<url>')
or url(<url>)
(can be quoted or not)@import(<url>)
JS
http://example.tld/path
)[EDIT] See also this post for some regexes to find urls. Incomplete as srcset
is not used there.
Maybe a good start ?
mech-dump --links 'https://stackoverflow.com/questions/62313765
Retrieve:
This command is installed with the perl module: WWW::Mechanize
Package libwww-mechanize-perl
for Debian based distros