[SOLVED] How to use wget with --accept-regex to mirror GitHub issues and PRs

How to use wget with --accept-regex to mirror GitHub issues and PRs

I'm trying to use Wget to download an HTML mirror of a GitHub repo (for example, this repo: https://github.com/seanh/oatcake).

In case it matters, I'm on macOS and using wget from Homebrew.

I don't want to download every URL under github.com/seanh/oatcake/*: that would be way too much data. I only want to download the the pull request pages, including the paginated index pages for browsing through open and closed PRs and the individual pages for each PR.

I've tried to use Wget's --accept-regex argument to limit which pages are downloaded, but it's not working. Specifically:

It does download the first page of the list of open PRs: https://github.com/seanh/oatcake/pulls
It does download the first page of the list of closed PRs: https://github.com/seanh/oatcake/pulls?q=is%3Apr+is%3Aclosed
It does not download the second page or further pages of the list of closed PRs, for example: https://github.com/seanh/oatcake/pulls?page=2&q=is%3Apr+is%3Aclosed
It does not download the individual pages for any of the PRs, for example: https://github.com/seanh/oatcake/pull/81

Testing the accept regex that the script is using on https://regex101.com/ it seems that the regex should match those URLs that Wget is not downloading.

Also, looking at the files that Wget has downloaded, those URLs for the second page and for individual PR pages are in the HTML that was downloaded, so Wget should be discovering those URLs.

Reading man wget, googling, asking an LLM, and just fiddling with the regex has gotten me nowhere. I'm at a loss and would appreciate any help, thanks.

Here's my script:

#!/usr/bin/env python3
import subprocess

accept_regex = "|".join([
    # Match the first page of the list of open PRs.
    r'\/pulls$',
    # Match the first page of the list of closed PRs.
    r'\/pulls\?q=is%3Apr\+is%3Aclosed$',
    # Match subsequent (paginated) pages of the list of closed PRs.
    r'\/pulls\?page=\d+&q=is%3Apr\+is%3Aclosed$',
    # Match individual PR pages.
    r'\/pull\/\d+$',
])

subprocess.run(
    [
        "wget",
        "--verbose",
        "--mirror",
        f"--accept-regex={accept_regex}",
        "--wait=1",
        "--random-wait",
        "--tries=inf",
        "--waitretry=3600",
        "--retry-connrefused",
        "--retry-on-host-error",
        "--retry-on-http-error=429",
        "--convert-links",
        "--adjust-extension",
        "--page-requisites",
        "--xattr",
        "--directory-prefix=site",
        "--append-output=wget_logfile",
        "--backups=99",
        "https://github.com/seanh/oatcake",
    ]
)

Solution

So specifying your OS "just in case" was the right move apparently, because the \d command doesn't work properly on Unix-like systems (you can find more information on another question or in this article): you just have to replace it with [0-9].

So your new accept_regex would be:

accept_regex = "|".join([
    # Match the first page of the list of open PRs.
    r'\/pulls$',
    # Match the first page of the list of closed PRs.
    r'\/pulls\?q=is%3Apr\+is%3Aclosed$',
    # Match subsequent (paginated) pages of the list of closed PRs.
    r'\/pulls\?page=[0-9]+&q=is%3Apr\+is%3Aclosed$',
    # Match individual PR pages.
    r'\/pull\/[0-9]+$',
])