I would like to download the HTML of all of the pages of a Google Site that can only be accessed by logging into Google. Google does not provide an API for the new Google Sites (source). To complicate matters, my Google login mandates 2SV.
I tried authenticating in Firefox, saving my cookies via the Firefox extension cookies.txt, and then using wget:
wget \
--load-cookies=cookies.txt \
--no-host-directories \
--no-directories \
--recursive \
--accept '*.html' \
https://sites.google.com/a/example.com/the-website-i-need/
The result was just a Google login page.
I also tried from within Firefox via the cliget plugin, which can generate a wget command equivalent to what Firefox does for downloads. My idea was to add the recursive options to the generated command. However, the plugin just reported "No downloads for this session", even after saving the root page of the Google Site as an .html file. I then initiated downloading a PDF file from the Google Site, which did trigger the cliget plugin. However, the resulting wget command resulted in 302 Moved Temporarily
, which wget faithfullly followed, but this processes repeated until, finally, wget gave up with 20 redirections exceeded
.
Can this be done with OAuth or some other method of authentication?
Related: Accessing a non-Public Google Sites page using curl + Bearer Token
I finally found a way to do this. Google Takeout allows you (in theory) to download all of your Google data, including the Google Sites.
There are some limitations:.
The short version:
The detailed version: