I have a list of 100 websites that I want to ping to see if they are live or not. I want to record the status returned for each record in a new field called 'status'. I have stored them in a dataframe and was hoping to use the apply function to parallelize the exercise, taking advantage of up to 8 cores on my laptop. At the moment it takes about 3m 30s, and i was naively hoping to get that down to less than 30 seconds. I've tried swifter but without success. I'd prefer some sort of apply function but am open to using multiprocessing / multithreading modules. I am not a programmer so this is really the limit of my abilities at the moment. Appreciate any ideas / advice
import pandas as pd
import requests
from urllib.parse import urlparse
import urllib3
import swifter
#Load data to dataframe
#List of sites
siteList=[['1','https://www.facebook.com'],[2,'https://www.instagram.com'], [3,'https://www.mail.com'],[4,'https://www.thegrumpyscarecrow.com/']]
df = pd.DataFrame(siteList, columns=['id','site'])
#functions
def getStatusCode(url):
try:
r = requests.head(url, verify=False, timeout=5)
return (r.status_code)
except:
return -1
#Run the script
df['status'] = df.swifter.allow_dask_on_strings(enable=True).apply(lambda x: getStatusCode(x['site']), axis=1, result_type='expand')
Instead of swifter
, you can use ThreadPoolExecutor
:
from concurrent.futures import ThreadPoolExecutor
from requests.exceptions import ConnectionError
requests.urllib3.disable_warnings()
def getStatusCode(url):
try:
r = requests.head(url, verify=False, timeout=5)
status = r.status_code
except ConnectionError:
status = -1
return status
with ThreadPoolExecutor() as executor:
status = executor.map(getStatusCode, df['site'])
df['status'] = list(status)
Output:
>>> df
id site status
0 1 https://www.facebook.com 200
1 2 https://www.instagram.com 200
2 3 https://www.mail.com 200
3 4 https://www.thegrumpyscarecrow.com/ -1