pythonpython-3.xpython-asyncioaiohttppytest-aiohttp

Aiohttp async session requests


So i've been scraping a website (www.cardsphere.com) protected pages with requests, using session, like so:

import requests

payload = {
            'email': <enter-email-here>,
            'password': <enter-site-password-here>
          }

with requests.Session() as request:
   requests.get(<site-login-page>)
   request.post(<site-login-here>, data=payload)
   request.get(<site-protected-page1>)
   save-stuff-from-page1
   request.get(<site-protected-page2>)
   save-stuff-from-page2
   .
   .
   .
   request.get(<site-protected-pageN>)
   save-stuff-from-pageN
the-end

Now since it's quite a bit of pages i wanted to speed it up with Aiohttp + asyncio...but i'm missing something. I've been able to more or less use it to scrape unprotected pages, like so:

import asyncio
import aiohttp

async def get_cards(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            data = await resp.text()
            <do-stuff-with-data>

urls  = [
         'https://www.<url1>.com'
         'https://www.<url2>.com'
         .
         .
         . 
         'https://www.<urlN>.com'
        ]

loop = asyncio.get_event_loop()
loop.run_until_complete(
    asyncio.gather(
        *(get_cards(url) for url in urls)
    )
)

That gave some results but how do i do it for pages that require login? I tried adding session.post(<login-url>,data=payload) inside the async function but that obviously didn't work out well, it will just keep logging in. Is there a way to "set" an aiohttp ClientSession before the loop function? As i need to login first and then, on the same session, get data from a bunch of protected links with asyncio + aiohttp?

Still rather new to python, async even more so, i'm missing some key concept here. If anybody would point me in the right direction i'll greatly appreciate it.


Solution

  • This is the simplest I can come up with, depending on what you do in <do-stuff-with-data> you may run into some other troubles regarding concurrency, down the rabbit hole you go... just kidding, its a little bit more complicated to wrap your head around coros and promises and tasks but once you get it is as simple as sequential programming

    import asyncio
    import aiohttp
    
    
    async def get_cards(url, session, sem):
        async with sem, session.get(url) as resp:
            data = await resp.text()
            # <do-stuff-with-data>
    
    
    urls = [
        'https://www.<url1>.com',
        'https://www.<url2>.com',
        'https://www.<urlN>.com'
    ]
    
    
    async def main():
        sem = asyncio.Semaphore(100)
        async with aiohttp.ClientSession() as session:
            await session.get('auth_url')
            await session.post('auth_url', data={'user': None, 'pass': None})
            tasks = [asyncio.create_task(get_cards(url, session, sem)) for url in urls]
            results = await asyncio.gather(*tasks)
            return results
    
    
    asyncio.run(main())