I am trying to scale the distance value returned from the 'nearest_edges' function(from the OSMNX library) on a huge dataset using the lat and long columns as my inputs to the creation of my mutlidigraph. It takes forever to run and sometimes returns null. Is there any other solution? I created a user defined function (code below) so I can apply that function to the dataset using the long lat columns of that dataset.
My code below:
import osmnx as ox
@udf(returnType=T.DoubleType())
def get_distance_to_road (lat_dd=None,long_dd=None,dist_bbox=None):
try:
location = (lat_dd,long_dd)
G = ox.graph_from_point(
center_point=location,
dist=dist_bbox, #meter
simplify=True,
retain_all=True,
truncate_by_edge=True,
network_type='all'
)
Gp = ox.project_graph(G)
point_geom_proj, crs = ox.projection.project_geometry(Point(reversed(location)), to_crs=Gp.graph['crs'])
distance = np.round(ox.nearest_edges(Gp, point_geom_proj.x, point_geom_proj.y, return_dist=True)[1],2).item()
except Exception:
distance = None
return distance #meter
The nearest_edges
function is fast and scalable. Rather, your problem here is everything else you're doing each time you call nearest_edges
.
First off, you always want to run it vectorized rather than in a loop. That is, if you have many points to snap to their nearest edges, pass them all at once as numpy arrays to the nearest_edges
function for vectorized, spatial indexed look-up:
import osmnx as ox
# get projected graph and randomly sample some points to find nearest edges to
G = ox.graph.graph_from_place("Piedmont, CA, USA", network_type="drive")
Gp = ox.projection.project_graph(G)
points = ox.utils_geo.sample_points(ox.convert.to_undirected(Gp), n=1000000)
%%time
ne, dist = ox.distance.nearest_edges(Gp, X=points.x, Y=points.y, return_dist=True)
# wall time = 8.3 seconds
Here, the nearest_edges
search matched 1 million points to their nearest edges in about 8 seconds. If you instead put this all into a loop (which with each iteration builds a graph, projects the graph and point, then finds the nearest edge to that one point), matching these million points will take approximately forever. This isn't because nearest_edges
is slow... it's because everything else in the loop is (relatively) slow.
Your basic options are: