python asynchronous python-requests threadpoolexecutor

Getting the status code of 2000 URLs and storing them as a dictionary output in a fast way

I want to get the status code of 2000 urls. I want to store the status code of the url as a dictionary key and the value being the url itself. I also want to do this as quickly as possible. I've seen stuff about async and ThreadPoolExecutor but I don't know how to use any of them yet. How could I solve this problem efficiently?

Here's what I have tried:

import requests 


def check_urls(list_of_urls):
    
    result = {"200": [], "404": [], "anything_else": []}
    
    for url in list_of_urls:
        try:
            response = requests.get(url)
            if response.status_code == 200:
                result["200"].append(url)
            elif response.status_code == 404:
                result["404"].append(url)
            else:
                result["anything_else"].append((url, f"HTTP Error {response.status_code}"))
        except requests.exceptions.RequestException as e:
            result["anything_else"] = ((url, e))
    
    return result

Is there any way to improve this code by making it faster to process 2000 URLs? I have tried requests.head but it's not accurate.

Solution

let's assume that you have all urls stored in a list:

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://nonexistant-subdomain.python.org/']

Then you can use any of this two solutions:

Solution 1 - multiprocessing

You may use library concurrent for multi-thread execution. I would also recommend to check library documentation - it has a very neat example that is very close to your case (https://docs.python.org/3/library/concurrent.futures.html)

import concurrent.futures
from multiprocessing import cpu_count
import requests

def get_url_status_code(url):
    # Retrieve a single page and return it's status code
    try:      
        response = requests.get(url, timeout = 10))
        return response.status_code
    except:
        return 404


# As an input to multiprocessing we need to give number of parallel threads to create
# Usually it is 2 times number of cores
n_threads = 2 * cpu_count()
print (f"Count threads available - {n_threads}")


# you need to use 'with' statement to ensure threads are cleaned up promptly after finishing jobs
with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as executor:

    # Start the load operations and mark each future with its URL
    # - to send a job into the thread pool we use executor.submit
    # - usually to distinct the jobs you match them with input parameters (`url`) 
    future_to_url = {executor.submit(get_url_status_code, url): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        status_code = future.result()
        print(url, status_code)

Solution 2 - async

Unfortunately requests library doesn't support async calls thus you need to improvise a bit.

import asyncio
import aiohttp


async def fetch_url(session, url:str):
    try:
        response = await session.get(url, timeout=10)
        return response.status
    except:
        return 404
    

async def async_aiohttp_get_all(urls, cookies):

    # context is required to correctly close async connections on errors
    async with aiohttp.ClientSession(cookies=cookies) as session:
        result =  await asyncio.gather(*[
            fetch_url(session, url) for url in urls
        ])
        return result


# asyncio.run triggers execution of async coroutines
results = asyncio.run(async_aiohttp_get_all(URLS, None))

# print response status codes
for i, url in enumerate(URLS):
    print(url, results[i])