I'm trying to use Wget to download an HTML mirror of a GitHub repo (for example, this repo: https://github.com/seanh/oatcake).
In case it matters, I'm on macOS and using wget from Homebrew.
I don't want to download every URL under github.com/seanh/oatcake/*
: that would be way too much data. I only want to download the the pull request pages, including the paginated index pages for browsing through open and closed PRs and the individual pages for each PR.
I've tried to use Wget's --accept-regex
argument to limit which pages are downloaded, but it's not working. Specifically:
Testing the accept regex that the script is using on https://regex101.com/ it seems that the regex should match those URLs that Wget is not downloading.
Also, looking at the files that Wget has downloaded, those URLs for the second page and for individual PR pages are in the HTML that was downloaded, so Wget should be discovering those URLs.
Reading man wget
, googling, asking an LLM, and just fiddling with the regex has gotten me nowhere. I'm at a loss and would appreciate any help, thanks.
Here's my script:
#!/usr/bin/env python3
import subprocess
accept_regex = "|".join([
# Match the first page of the list of open PRs.
r'\/pulls$',
# Match the first page of the list of closed PRs.
r'\/pulls\?q=is%3Apr\+is%3Aclosed$',
# Match subsequent (paginated) pages of the list of closed PRs.
r'\/pulls\?page=\d+&q=is%3Apr\+is%3Aclosed$',
# Match individual PR pages.
r'\/pull\/\d+$',
])
subprocess.run(
[
"wget",
"--verbose",
"--mirror",
f"--accept-regex={accept_regex}",
"--wait=1",
"--random-wait",
"--tries=inf",
"--waitretry=3600",
"--retry-connrefused",
"--retry-on-host-error",
"--retry-on-http-error=429",
"--convert-links",
"--adjust-extension",
"--page-requisites",
"--xattr",
"--directory-prefix=site",
"--append-output=wget_logfile",
"--backups=99",
"https://github.com/seanh/oatcake",
]
)
So specifying your OS "just in case" was the right move apparently, because the \d
command doesn't work properly on Unix-like systems (you can find more information on another question or in this article): you just have to replace it with [0-9]
.
So your new accept_regex
would be:
accept_regex = "|".join([
# Match the first page of the list of open PRs.
r'\/pulls$',
# Match the first page of the list of closed PRs.
r'\/pulls\?q=is%3Apr\+is%3Aclosed$',
# Match subsequent (paginated) pages of the list of closed PRs.
r'\/pulls\?page=[0-9]+&q=is%3Apr\+is%3Aclosed$',
# Match individual PR pages.
r'\/pull\/[0-9]+$',
])