pandasparallel-processingapplypingurllib3

Pandas - Parallelize the pinging of a list of websites


I have a list of 100 websites that I want to ping to see if they are live or not. I want to record the status returned for each record in a new field called 'status'. I have stored them in a dataframe and was hoping to use the apply function to parallelize the exercise, taking advantage of up to 8 cores on my laptop. At the moment it takes about 3m 30s, and i was naively hoping to get that down to less than 30 seconds. I've tried swifter but without success. I'd prefer some sort of apply function but am open to using multiprocessing / multithreading modules. I am not a programmer so this is really the limit of my abilities at the moment. Appreciate any ideas / advice

import pandas as pd
import requests
from urllib.parse import urlparse
import urllib3
import swifter

#Load data to dataframe
#List of sites
siteList=[['1','https://www.facebook.com'],[2,'https://www.instagram.com'], [3,'https://www.mail.com'],[4,'https://www.thegrumpyscarecrow.com/']]

df = pd.DataFrame(siteList, columns=['id','site'])

#functions
def getStatusCode(url):
    try:
        r = requests.head(url, verify=False, timeout=5)
        return (r.status_code)
    except:
        return -1

#Run the script
df['status'] = df.swifter.allow_dask_on_strings(enable=True).apply(lambda x: getStatusCode(x['site']), axis=1, result_type='expand')


Solution

  • Instead of swifter, you can use ThreadPoolExecutor:

    from concurrent.futures import ThreadPoolExecutor
    from requests.exceptions import ConnectionError
    requests.urllib3.disable_warnings()
    
    def getStatusCode(url):
        try:
            r = requests.head(url, verify=False, timeout=5)
            status = r.status_code
        except ConnectionError:
            status = -1
        return status
    
    with ThreadPoolExecutor() as executor:
        status = executor.map(getStatusCode, df['site'])
    df['status'] = list(status)
    

    Output:

    >>> df
       id                                 site  status
    0   1             https://www.facebook.com     200
    1   2            https://www.instagram.com     200
    2   3                 https://www.mail.com     200
    3   4  https://www.thegrumpyscarecrow.com/      -1