I want to get the status code of 2000 urls. I want to store the status code of the url as a dictionary key and the value being the url itself. I also want to do this as quickly as possible. I've seen stuff about async and ThreadPoolExecutor but I don't know how to use any of them yet. How could I solve this problem efficiently?
Here's what I have tried:
import requests
def check_urls(list_of_urls):
result = {"200": [], "404": [], "anything_else": []}
for url in list_of_urls:
try:
response = requests.get(url)
if response.status_code == 200:
result["200"].append(url)
elif response.status_code == 404:
result["404"].append(url)
else:
result["anything_else"].append((url, f"HTTP Error {response.status_code}"))
except requests.exceptions.RequestException as e:
result["anything_else"] = ((url, e))
return result
Is there any way to improve this code by making it faster to process 2000 URLs? I have tried requests.head
but it's not accurate.
let's assume that you have all urls stored in a list:
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://nonexistant-subdomain.python.org/']
Then you can use any of this two solutions:
Solution 1 - multiprocessing
You may use library concurrent
for multi-thread execution. I would also recommend to check library documentation - it has a very neat example that is very close to your case (https://docs.python.org/3/library/concurrent.futures.html)
import concurrent.futures
from multiprocessing import cpu_count
import requests
def get_url_status_code(url):
# Retrieve a single page and return it's status code
try:
response = requests.get(url, timeout = 10))
return response.status_code
except:
return 404
# As an input to multiprocessing we need to give number of parallel threads to create
# Usually it is 2 times number of cores
n_threads = 2 * cpu_count()
print (f"Count threads available - {n_threads}")
# you need to use 'with' statement to ensure threads are cleaned up promptly after finishing jobs
with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as executor:
# Start the load operations and mark each future with its URL
# - to send a job into the thread pool we use executor.submit
# - usually to distinct the jobs you match them with input parameters (`url`)
future_to_url = {executor.submit(get_url_status_code, url): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
status_code = future.result()
print(url, status_code)
Solution 2 - async
Unfortunately requests
library doesn't support async
calls thus you need to improvise a bit.
import asyncio
import aiohttp
async def fetch_url(session, url:str):
try:
response = await session.get(url, timeout=10)
return response.status
except:
return 404
async def async_aiohttp_get_all(urls, cookies):
# context is required to correctly close async connections on errors
async with aiohttp.ClientSession(cookies=cookies) as session:
result = await asyncio.gather(*[
fetch_url(session, url) for url in urls
])
return result
# asyncio.run triggers execution of async coroutines
results = asyncio.run(async_aiohttp_get_all(URLS, None))
# print response status codes
for i, url in enumerate(URLS):
print(url, results[i])