pythonpandasgeopandas

Fetching NUTS2 region from geocoordinates and/or postalcode


I have a df that contains data including geocoordinates and postcodes.

lat     | lon  | postcode
54.3077 | 12.7 | 18314   
51.898  | 9.26 | 32676

I need a new colum with the NUTS2 region, so in this case the resulting df should look something like this:

lat     | lon  | postcode | NUTS_ID
54.3077 | 12.7 | 18314    | DE80
51.898  | 9.26 | 32676    | DEA4

I found this package: https://github.com/vis4/pyshpgeocode which I managed to run. My first approach are the following two functions:

def get_gc(nuts='geoc\\shapes\\nuts2\\nuts2.shp'):
    """
    nuts -> path to nuts file
    """
    gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
    return gc
def add_nuts_to_df(df):
    """
    df must have lon/lat geocoodinates in it
    This function will add a column ['NUTS_ID'] with the corresponding
    NUTS region
    """
    start_time = time.time()
    for idx, row in df.iterrows():
        df.loc[idx, 'NUTS_ID'] = get_gc().geocode(row.lat,
                                                  row.lon,
              filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
        print('Done with index {}\nTime since start: {}s'.format(idx,
              round(time.time() - start_time, 0 )))
    return df

And this does work! However, it takes ~ 0.6s for one entry and some of my df have more than a million entries. Since my original dataframes usually contain postcodes I was thinking about aggregating them using a combination of groupby / apply / transform?

Or is there any other (more efficient) way of doing this?

I am very grateful for any help and look forward to receiving replies.


Solution

  • If I understand your code correctly you are re-creating the gc object for every single request from the same input file. I don't understand why.

    One possibility therefore could be to do the following:

    def add_nuts_to_df(df):
        """
        df must have lon/lat geocoodinates in it
        This function will add a column ['NUTS_ID'] with the corresponding
        NUTS region
        """
        nuts='geoc\\shapes\\nuts2\\nuts2.shp'
        gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
    
        start_time = time.time()
        for idx, row in df.iterrows():
            df.loc[idx, 'NUTS_ID'] = gc.geocode(row.lat,
                                                      row.lon,
                  filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
            print('Done with index {}\nTime since start: {}s'.format(idx,
                  round(time.time() - start_time, 0 )))
        return df
    

    Maybe it would speed up the process even more if you try to use the df.apply() method and pass your geocode logic in a function there.

    Something like:

    nuts='geoc\\shapes\\nuts2\\nuts2.shp'
    gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
    
    def get_nuts_id(row):
        return gc.geocode(row.lat, row.lon, 
                          filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
    
    df["NUTS_ID"] = df.apply(get_nuts_id,axis=1)
    
    

    I didn't try this out though so beware of typos.