pythonasynchronouspython-requeststhreadpoolexecutor

Getting the status code of 2000 URLs and storing them as a dictionary output in a fast way


I want to get the status code of 2000 urls. I want to store the status code of the url as a dictionary key and the value being the url itself. I also want to do this as quickly as possible. I've seen stuff about async and ThreadPoolExecutor but I don't know how to use any of them yet. How could I solve this problem efficiently?

Here's what I have tried:

import requests 


def check_urls(list_of_urls):
    
    result = {"200": [], "404": [], "anything_else": []}
    
    for url in list_of_urls:
        try:
            response = requests.get(url)
            if response.status_code == 200:
                result["200"].append(url)
            elif response.status_code == 404:
                result["404"].append(url)
            else:
                result["anything_else"].append((url, f"HTTP Error {response.status_code}"))
        except requests.exceptions.RequestException as e:
            result["anything_else"] = ((url, e))
    
    return result 

Is there any way to improve this code by making it faster to process 2000 URLs? I have tried requests.head but it's not accurate.


Solution

  • let's assume that you have all urls stored in a list:

    URLS = ['http://www.foxnews.com/',
            'http://www.cnn.com/',
            'http://europe.wsj.com/',
            'http://www.bbc.co.uk/',
            'http://nonexistant-subdomain.python.org/']
    

    Then you can use any of this two solutions:

    Solution 1 - multiprocessing

    You may use library concurrent for multi-thread execution. I would also recommend to check library documentation - it has a very neat example that is very close to your case (https://docs.python.org/3/library/concurrent.futures.html)

    import concurrent.futures
    from multiprocessing import cpu_count
    import requests
    
    def get_url_status_code(url):
        # Retrieve a single page and return it's status code
        try:      
            response = requests.get(url, timeout = 10))
            return response.status_code
        except:
            return 404
    
    
    # As an input to multiprocessing we need to give number of parallel threads to create
    # Usually it is 2 times number of cores
    n_threads = 2 * cpu_count()
    print (f"Count threads available - {n_threads}")
    
    
    # you need to use 'with' statement to ensure threads are cleaned up promptly after finishing jobs
    with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as executor:
    
        # Start the load operations and mark each future with its URL
        # - to send a job into the thread pool we use executor.submit
        # - usually to distinct the jobs you match them with input parameters (`url`) 
        future_to_url = {executor.submit(get_url_status_code, url): url for url in URLS}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            status_code = future.result()
            print(url, status_code)
    

    Solution 2 - async

    Unfortunately requests library doesn't support async calls thus you need to improvise a bit.

    import asyncio
    import aiohttp
    
    
    async def fetch_url(session, url:str):
        try:
            response = await session.get(url, timeout=10)
            return response.status
        except:
            return 404
        
    
    async def async_aiohttp_get_all(urls, cookies):
    
        # context is required to correctly close async connections on errors
        async with aiohttp.ClientSession(cookies=cookies) as session:
            result =  await asyncio.gather(*[
                fetch_url(session, url) for url in urls
            ])
            return result
    
    
    # asyncio.run triggers execution of async coroutines
    results = asyncio.run(async_aiohttp_get_all(URLS, None))
    
    # print response status codes
    for i, url in enumerate(URLS):
        print(url, results[i])