python-3.xpython-asynciopyppeteer

How to fetch a url asynchronously with pyppeteer(One browser many tabs)


I want my script to

  1. Open say 3 tabs

  2. Asynchronously fetch a url(same for each tab)

  3. Save the response

  4. Sleep for 4 seconds

  5. Parse through the response with regex(I tried BeautifulSoup but its too slow) and return a token

  6. Loop through several times within the 3 tabs

My problem is with 2. I have an example script but it synchronously fetches the url. I would like to make it asynchronous.

from pyppeteer import launch

urls = ['https://www.example.com']


async def main():
    browser = await launch(headless=False)
    for url in urls:
        page1 = await browser.newPage()
        page2 = await browser.newPage()
        page3 = await browser.newPage()

        await page1.goto(url)
        await page2.goto(url)
        await page3.goto(url)

        title1= await page1.title()
        title2= await page2.title()
        title3= await page3.title()

        print(title1)
        print(title2)
        print(title3)

    #await browser.close()


asyncio.get_event_loop().run_until_complete(main())

Also, as you can see, the code is not so concise. How do I go about making it asynchronous?

Also if it helps, I have other pyppeteer scripts which don't fit my need just in case it would be easier to convert those

import asyncio
from pyppeteer import launch

url = 'http://www.example.com'
browser = None

async def fetchUrl(url):
    # Define browser as a global variable to ensure that the browser window is only created once in the entire process
    global browser
    if browser is None:
        browser = await launch(headless=False)

    page = await browser.newPage()

    await page.goto(url)
    #await asyncio.wait([page.waitForNavigation()])
    #str = await page.content()
    #print(str)

 # Execute this function multiple times for testing
asyncio.get_event_loop().run_until_complete(fetchUrl(url))
asyncio.get_event_loop().run_until_complete(fetchUrl(url))

The script is asynchronous but it executes one event loop at a time so its as good as synchronous.

# cat test.py
import asyncio
import time
from pyppeteer import launch

WEBSITE_LIST = [
    'http://envato.com',
    'http://amazon.co.uk',
    'http://example.com',
]

start = time.time()

async def fetch(url):
    browser = await launch(headless=False, args=['--no-sandbox'])
    page = await browser.newPage()
    await page.goto(f'{url}', {'waitUntil': 'load'})
    print(f'{url}')
    await asyncio.sleep(1)
    await page.close()
    #await browser.close()

async def run():
    tasks = []

    for url in WEBSITE_LIST:
        task = asyncio.ensure_future(fetch(url))
        tasks.append(task)

    responses = await asyncio.gather(*tasks)

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)

print(f'It took {time.time()-start} seconds.')

The script is asynchronous but it launches a separate browser for each url which ends up taking too many resources.


Solution

  • This will open each URL in a separate tab:

    import asyncio
    import traceback
    
    from pyppeteer import launch
    
    URLS = [
        "http://envato.com",
        "http://amazon.co.uk",
        "http://example.com",
    ]
    
    
    async def fetch(browser, url):
        page = await browser.newPage()
    
        try:
            await page.goto(f"{url}", {"waitUntil": "load"})
        except Exception:
            traceback.print_exc()
        else:
            html = await page.content()
            return (url, html)
        finally:
            await page.close()
    
    
    async def main():
        tasks = []
        browser = await launch(headless=True, args=["--no-sandbox"])
    
        for url in URLS:
            tasks.append(asyncio.create_task(fetch(browser, url)))
    
        for coro in asyncio.as_completed(tasks):
            url, html = await coro
            print(f"{url}: ({len(html)})")
    
        await browser.close()
    
    
    if __name__ == "__main__":
        main = asyncio.run(main())