pythongeocodinglatitude-longitudegeopyzipcode

Use Python to covert 400,000 latitude & longitudes to zip code


I have 400,000 cases with latitudes and longitudes. I want to convert these to zip codes. The code below works...

import geopy

from geopy.geocoders import Nominatim

geolocator = geopy.Nominatim(user_agent='my-application')

def get_zipcode(df, geolocator, lat_field, lon_field):
    location = geolocator.reverse((df[lat_field], df[lon_field]))
    if 'address' in location.raw.keys():
      if 'postcode' in location.raw['address'].keys():
        return location.raw['address']['postcode']
    else:
      None

But only on smaller batches, but it takes a while, like 15 minutes for 2,000 cases.

dfbatch1['pickup_zip'] = dfbatch1.apply(get_zipcode, axis=1, geolocator=geolocator, lat_field='pickup_latitude', lon_field='pickup_longitude')

What would be the best way to convert all of my latitudes & longitudes to zip codes?

Thanks!


Solution

  • Warning: not a GIS expert here!

    It seems like this would be pretty easy using geopandas and a source of zip code polygons (noting, of course, that zip codes are not, in fact, polygons):

    For example, if I have a point data source with (lat, lon) pairs in a file points.geojson, I could do something like this:

    import geopandas
    
    points = geopandas.read_file('points.geojson')
    zipcodes = geopandas.read_file("zip_poly.gdb")
    zip_points = points.sjoin(zipcodes, how='left', )
    

    The default behavior of sjoin is to perform an intersects query, which is what we want.

    That gives me a geodataframe that maps coordinates (in the .geometry attribute) to zip codes (in the .ZIP_CODE attribute). I used this source for zip code data.

    For example, given a point:

    >>> points.query('NAME == "Boston"').geometry
    1436    POINT (-71.05671 42.35959)
    Name: geometry, dtype: geometry
    

    I now know its zip code:

    >>> zip_points.query('NAME=="Boston"').ZIP_CODE
    1436    02109
    Name: ZIP_CODE, dtype: object
    

    I tested this using a data source with about 4000 points; I don't have handy anything approaching your 400000 point data source.