[SOLVED] How to find all (possibly relative) urls on a website?

How to find all (possibly relative) urls on a website?

As a programming exercise, I'm crafting a small python tool to download a whole website locally. To be able to browse the website locally, I need to translate all URLs to relative URLs. Otherwise, resources files (.js, .css) would be downloaded from the original website instead of using the locally downloaded version. And since I need to rewrite URLs, I figured I can also change the files hierarchy. This leads to this slightly more general question:

How can I find all URLs in a website? A regex based on http://domain.tld/path won't cut it because an href attribute might contain a relative URL.

So far, I have identified the followings:

HTML

href=<url> (quoted)
src=<url> (quoted)
srcset=<list>
action=<url> (quoted)
background=<url> (quoted)

CSS

url('<url>') or url(<url>) (can be quoted or not)
@import(<url>)

for JS I don't think a perfect solution exist. Using a regex to look for full URLs seems like the only way (e.g. http://example.tld/path)

[EDIT] See also this post for some regexes to find urls. Incomplete as srcset is not used there.

Solution

Maybe a good start ?

mech-dump --links 'https://stackoverflow.com/questions/62313765

Retrieve:

mailto
http(s) links
images

This command is installed with the perl module: WWW::Mechanize

Package libwww-mechanize-perl for Debian based distros