I have the following code for getting IP information:
import requests
import json
import pandas as pd
import swifter
def get_ip(ip):
response = requests.get ("http://ip-api.com/json/" + ip.rstrip())
geo = response.json()
location = {'lat': geo.get('lat', ''),
'lon': geo.get('lon', ''),
'region': geo.get('regionName', ''),
'city': geo.get('city', ''),
'org': geo.get('org', ''),
'country': geo.get('countryCode', ''),
'query': geo.get('query', '')
}
return(location)
For applying it to an entire dataframe of IPs (df) I am using the next:
df=pd.DataFrame(['85.56.19.4','188.85.165.103','81.61.223.131'])
for lab,row in df.iterrows():
dip = get_ip(df.iloc[lab][0])
try:
ip.append(dip["query"])
private.append('no')
country.append(dip["country"])
city.append(dip["city"])
region.append(dip["region"])
organization.append(dip["org"])
latitude.append(dip["lat"])
longitude.append(dip["lon"])
except:
ip.append(df.iloc[lab][0])
private.append("yes")
However, since iterrows is very slow and I need more performance, I want to use swiftapply, which is an extension of apply function. I have used this:
def ip(x):
dip = get_ip(x)
if (dip['ip']=='private')==True:
ip.append(x)
private.append("yes")
else:
ip.append(dip["ip"])
private.append('no')
country.append(dip["country"])
city.append(dip["city"])
region.append(dip["region"])
organization.append(dip["org"])
latitude.append(dip["lat"])
longitude.append(dip["lon"])
df.swifter.apply(ip)
And I get the following error: AttributeError: ("'Series' object has no attribute 'rstrip'", 'occurred at index 0')
How could I fix it?
rstrip
is a string operation. In order to apply a string operation to a series Series
you have to first call the str
function on the series, which allows vectorized string operations to be performed on a Series
.
Specifically, in your code changing ip.rstrip()
to ip.str.rstrip()
should resolve your AttributeError
.
After digging around a little it turns out the requests.get
operation you're trying to perform cannot be vectorized within pandas
(see Using Python Requests for several URLS in a dataframe). I hacked up the following that should be a little more efficient than using iterrows
. What the following does is utilizes np.vectorize
to run the function to get information for each IP address. The location input is saved as new columns in a new DataFrame.
First, I altered your get_ip
function to return the location
dictionary, not (location)
.
Next, I created a vectorization function using np.vectorize
:
vec_func = np.vectorize(lambda url: get_ip(url))
Finally, vec_func
is applied to df
to create a new DataFrame that merges df
with the location output from vec_func
where df[0]
is the column with your URLs:
new_df = pd.concat([df, pd.DataFrame(vec_func(df[0]), columns=["response"])["response"].apply(pd.Series)], axis=1)
The code above retrieves the API response in the form of a dictionary for each row in your DataFrame then maps the dictionary to columns in the DataFrame. In the end your new DataFrame would look like this:
0 lat lon region city org country query
0 85.56.19.4 37.3824 -5.9761 Andalusia Seville Orange Espana ES 85.56.19.4
1 188.85.165.103 41.6561 -0.8773 Aragon Zaragoza Vodafone Spain ES 188.85.165.103
2 81.61.223.131 40.3272 -3.7635 Madrid Leganés Vodafone Ono ES 81.61.223.131
Hopefully this resolves the InvalidSchema
error and gets you a little better performance than iterrows()
.