I'm trying to automate downloading the holdings of Vanguard funds from the web. The links resolve through JavaScript so I'm using Pyppeteer but I'm not getting the file. Note, the link says CSV but it provides an Excel file.
From my browser it works like this:
My attempt to replicate this in Python follows. The first link follow seems to work because I get different HTML but the second click gives me the same page, not a download.
import asyncio
from pyppeteer import launch
import os
async def get_page(browser, url):
page = await browser.newPage()
await page.goto(url)
return page
async def fetch(url):
browser = await launch(options={'args': ['--no-sandbox']}) #headless=True,
page = await get_page(browser, url)
await page.waitFor(2000)
# save the page so we can see the source
wkg_dir = 'vanguard'
t_file = os.path.join(wkg_dir,'8225.htm')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
accept = await page.xpath('//a[contains(., "See count total holdings")]')
print(f'Found {len(accept)} "See count total holdings" links')
if accept:
await accept[0].click()
await page.waitFor(2000)
else:
print('DID NOT FIND THE LINK')
return False
# save the pop-up page for debug
t_file = os.path.join(wkg_dir,'first_page.htm')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
links = await page.xpath('//a[contains(., "Export to CSV")]')
print(f'Found {len(links)} "Export to CSV" links') # 3 of these
for i, link in enumerate(links):
print(f'Trying link {i}')
await link.click()
await page.waitFor(2000)
t_file = os.path.join(wkg_dir,f'csv_page{i}.csv')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
return True
#---------- Main ------------
# Set constants and global variables
url = 'https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio'
loop = asyncio.get_event_loop()
status = loop.run_until_complete(fetch(url))
Would love to hear suggestions from anyone that knows Puppeteer / Pyppeteer well.
First of all, page.waitFor(2000)
should be the last resort. That's a race condition that can lead to a false negative at worst and slows your scrape down at best. I recommend page.waitForXPath
which spawns a tight polling loop to continue your code as soon as the xpath becomes available.
Also on the topic of element selection, I'd use text()
in your xpath instead of .
which is more precise.
I'm not sure how ef.write(await page.content())
is working for you -- that should only give page HTML, not the XLSX download. The link click triggers downloads via a dialog. Accepting this download involves enabling Chrome downloads with
await page._client.send("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": r"C:\Users\you\Desktop" # TODO set your path
})
The next hurdle is bypassing or suppressing the "multiple downloads" permission prompt Chrome displays when you try to download multiple files on the same page. I wasn't able to figure out how to stop this, so my code just navigates to the page for each link as a workaround. I'll leave my solution as sub-optimal but functional and let others (or my future self) improve on it.
By the way, two of the XLSX files at indices 1 and 2 seem to be identical. This code downloads all 3 anyway, but you can probably skip the last depending on whether the page changes or not over time -- I'm not familiar with it.
I'm using a trick for clicking non-visible elements, using the browser console's click rather than Puppeteer: page.evaluate("e => e.click()", csv_link)
Here's my attempt:
import asyncio
from pyppeteer import launch
async def get_csv_links(page):
await page.goto(url)
xp = '//a[contains(text(), "See count total holdings")]'
await page.waitForXPath(xp)
accept, = await page.xpath(xp)
await accept.click()
xp = '//a[contains(text(), "Export to CSV")]'
await page.waitForXPath(xp)
return await page.xpath(xp)
async def fetch(url):
browser = await launch(headless=False)
page, = await browser.pages()
await page._client.send("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": r"C:\Users\you\Desktop" # TODO set your path
})
csv_links = await get_csv_links(page)
for i in range(len(csv_links)):
# open a fresh page each time as a hack to avoid multiple file prompts
csv_link = (await get_csv_links(page))[i]
await page.evaluate("e => e.click()", csv_link)
# let download finish; this is a race condition
await page.waitFor(3000)
if __name__ == "__main__":
url = "https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio"
asyncio.run(fetch(url))
Notes for improvement:
--enable-parallel-downloading
or a setting like 'profile.default_content_setting_values.automatic_downloads': 1
to suppress the "multiple downloads" warning.waitFor(3000)
can be removed. Another option here might involve polling for the files you expect; you can visit the linked thread for ideas.Other resources for posterity: