python dataframe pyspark user-defined-functions osmnx

Scaling OSMNX library's 'nearest_edges' function on huge spark dataset

I am trying to scale the distance value returned from the 'nearest_edges' function(from the OSMNX library) on a huge dataset using the lat and long columns as my inputs to the creation of my mutlidigraph. It takes forever to run and sometimes returns null. Is there any other solution? I created a user defined function (code below) so I can apply that function to the dataset using the long lat columns of that dataset.

My code below:

import osmnx as ox
@udf(returnType=T.DoubleType())
def get_distance_to_road (lat_dd=None,long_dd=None,dist_bbox=None):
    try:
      location = (lat_dd,long_dd)

      G = ox.graph_from_point(
        center_point=location, 
        dist=dist_bbox,       #meter
        simplify=True, 
        retain_all=True,
        truncate_by_edge=True,
        network_type='all'
        )

      Gp = ox.project_graph(G)
      point_geom_proj, crs = ox.projection.project_geometry(Point(reversed(location)), to_crs=Gp.graph['crs'])
      distance = np.round(ox.nearest_edges(Gp, point_geom_proj.x, point_geom_proj.y, return_dist=True)[1],2).item() 
      
    except Exception:
      distance = None
    return distance  #meter

Solution

The nearest_edges function is fast and scalable. Rather, your problem here is everything else you're doing each time you call nearest_edges. First off, you always want to run it vectorized rather than in a loop. That is, if you have many points to snap to their nearest edges, pass them all at once as numpy arrays to the nearest_edges function for vectorized, spatial indexed look-up:

import osmnx as ox

# get projected graph and randomly sample some points to find nearest edges to
G = ox.graph.graph_from_place("Piedmont, CA, USA", network_type="drive")
Gp = ox.projection.project_graph(G)
points = ox.utils_geo.sample_points(ox.convert.to_undirected(Gp), n=1000000)

%%time
ne, dist = ox.distance.nearest_edges(Gp, X=points.x, Y=points.y, return_dist=True)
# wall time = 8.3 seconds

Here, the nearest_edges search matched 1 million points to their nearest edges in about 8 seconds. If you instead put this all into a loop (which with each iteration builds a graph, projects the graph and point, then finds the nearest edge to that one point), matching these million points will take approximately forever. This isn't because nearest_edges is slow... it's because everything else in the loop is (relatively) slow.

Your basic options are:

Vectorize everything as demonstrated above.
If you must build separate graphs (like, you're modeling completely different cities or countries or something), try to reduce the number of graphs you build by batching your nearby points to search within a single graph.
Use multiprocessing to parallelize.