I have a df that contains data including geocoordinates and postcodes.
lat | lon | postcode
54.3077 | 12.7 | 18314
51.898 | 9.26 | 32676
I need a new colum with the NUTS2 region, so in this case the resulting df should look something like this:
lat | lon | postcode | NUTS_ID
54.3077 | 12.7 | 18314 | DE80
51.898 | 9.26 | 32676 | DEA4
I found this package: https://github.com/vis4/pyshpgeocode which I managed to run. My first approach are the following two functions:
def get_gc(nuts='geoc\\shapes\\nuts2\\nuts2.shp'):
"""
nuts -> path to nuts file
"""
gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
return gc
def add_nuts_to_df(df):
"""
df must have lon/lat geocoodinates in it
This function will add a column ['NUTS_ID'] with the corresponding
NUTS region
"""
start_time = time.time()
for idx, row in df.iterrows():
df.loc[idx, 'NUTS_ID'] = get_gc().geocode(row.lat,
row.lon,
filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
print('Done with index {}\nTime since start: {}s'.format(idx,
round(time.time() - start_time, 0 )))
return df
And this does work! However, it takes ~ 0.6s for one entry and some of my df have more than a million entries. Since my original dataframes usually contain postcodes I was thinking about aggregating them using a combination of groupby / apply / transform?
Or is there any other (more efficient) way of doing this?
I am very grateful for any help and look forward to receiving replies.
If I understand your code correctly you are re-creating the gc
object for every single request from the same input file. I don't understand why.
One possibility therefore could be to do the following:
def add_nuts_to_df(df):
"""
df must have lon/lat geocoodinates in it
This function will add a column ['NUTS_ID'] with the corresponding
NUTS region
"""
nuts='geoc\\shapes\\nuts2\\nuts2.shp'
gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
start_time = time.time()
for idx, row in df.iterrows():
df.loc[idx, 'NUTS_ID'] = gc.geocode(row.lat,
row.lon,
filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
print('Done with index {}\nTime since start: {}s'.format(idx,
round(time.time() - start_time, 0 )))
return df
Maybe it would speed up the process even more if you try to use the df.apply()
method and pass your geocode logic in a function there.
Something like:
nuts='geoc\\shapes\\nuts2\\nuts2.shp'
gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
def get_nuts_id(row):
return gc.geocode(row.lat, row.lon,
filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
df["NUTS_ID"] = df.apply(get_nuts_id,axis=1)
I didn't try this out though so beware of typos.