pythonpandasloopsapplyswifter

How to implement this iterrow case using an apply function in pandas?


I have the following code for getting IP information:

import requests
import json
import pandas as pd
import swifter  

def get_ip(ip):
    response = requests.get ("http://ip-api.com/json/" + ip.rstrip())
    geo = response.json()
    location = {'lat': geo.get('lat', ''),
                'lon': geo.get('lon', ''),
                'region': geo.get('regionName', ''),
                'city': geo.get('city', ''),
                'org': geo.get('org', ''),
                'country': geo.get('countryCode', ''),
                'query': geo.get('query', '')
                }
    return(location)

For applying it to an entire dataframe of IPs (df) I am using the next:

df=pd.DataFrame(['85.56.19.4','188.85.165.103','81.61.223.131'])

for lab,row in df.iterrows():
    dip = get_ip(df.iloc[lab][0])
    try:
        ip.append(dip["query"])
        private.append('no')
        country.append(dip["country"])
        city.append(dip["city"])
        region.append(dip["region"])
        organization.append(dip["org"])
        latitude.append(dip["lat"])
        longitude.append(dip["lon"])
    except:
        ip.append(df.iloc[lab][0])
        private.append("yes")

However, since iterrows is very slow and I need more performance, I want to use swiftapply, which is an extension of apply function. I have used this:

def ip(x):
    dip = get_ip(x)
    if (dip['ip']=='private')==True:
        ip.append(x)
        private.append("yes")
    else:
        ip.append(dip["ip"])
        private.append('no')
        country.append(dip["country"])
        city.append(dip["city"])
        region.append(dip["region"])
        organization.append(dip["org"])
        latitude.append(dip["lat"])
        longitude.append(dip["lon"])

df.swifter.apply(ip)

And I get the following error: AttributeError: ("'Series' object has no attribute 'rstrip'", 'occurred at index 0')

How could I fix it?


Solution

  • rstrip is a string operation. In order to apply a string operation to a series Series you have to first call the str function on the series, which allows vectorized string operations to be performed on a Series.

    Specifically, in your code changing ip.rstrip() to ip.str.rstrip() should resolve your AttributeError.

    After digging around a little it turns out the requests.get operation you're trying to perform cannot be vectorized within pandas (see Using Python Requests for several URLS in a dataframe). I hacked up the following that should be a little more efficient than using iterrows. What the following does is utilizes np.vectorize to run the function to get information for each IP address. The location input is saved as new columns in a new DataFrame.

    First, I altered your get_ip function to return the location dictionary, not (location).

    Next, I created a vectorization function using np.vectorize:

    vec_func = np.vectorize(lambda url: get_ip(url))
    

    Finally, vec_func is applied to df to create a new DataFrame that merges df with the location output from vec_func where df[0] is the column with your URLs:

    new_df = pd.concat([df, pd.DataFrame(vec_func(df[0]), columns=["response"])["response"].apply(pd.Series)], axis=1)
    

    The code above retrieves the API response in the form of a dictionary for each row in your DataFrame then maps the dictionary to columns in the DataFrame. In the end your new DataFrame would look like this:

                    0      lat     lon     region      city             org country           query
    0      85.56.19.4  37.3824 -5.9761  Andalusia   Seville   Orange Espana      ES      85.56.19.4
    1  188.85.165.103  41.6561 -0.8773     Aragon  Zaragoza  Vodafone Spain      ES  188.85.165.103
    2   81.61.223.131  40.3272 -3.7635     Madrid   Leganés    Vodafone Ono      ES   81.61.223.131
    

    Hopefully this resolves the InvalidSchema error and gets you a little better performance than iterrows().