pythonpandasdataframeeuclidean-distancepandas-merge

Find the closest location between two dataframes based on distance


I have two dataframes, the first one has the longitude and latitude for the meters, and the second dataframe has the longitude and latitude of the stations.

I am trying to link them by closest match.

Here is my example:

df_id = pd.DataFrame({
    'id': [1, 2],
    'lat': [32, 55],
    'long': [-89, -8]
})


df_station = pd.DataFrame({
    'id': [10, 20],
    'lat': [33, 56],
    'long': [-88.23, -7]        
})

and here is the expected output:

enter image description here


Solution

  • Pandas' cross merge should help to pair ids and stations. Once the two frames are merged, we can find the Euclidean distance between each lat-long pairs and filter the closest stations to each id.

    # cross merge the 2 dfs to pair all ids with stations
    df_merged = df_id.merge(df_station.add_suffix('_station'), how='cross')
    # find euclidean distance between all pairs of locations
    df_merged['distance'] = ((df_merged.lat - df_merged.lat_station)**2 + (df_merged.long - df_merged.long_station)**2).pow(0.5)
    # filter the closest station for each id
    df = df_merged.loc[df_merged.groupby('id')['distance'].idxmin(), ['id', 'id_station']]
    

    The final step of filtering can be replaced with sort_values + drop_duplicates as well. The idea is that once we sort by distance, for each id, each additional row has greater distance value than the first distance. Then we can write the entire thing in a nice little chained one-liner.

    df = (
        df_id
        .merge(df_station.add_suffix('_station'), how='cross')
        .eval('distance = ((lat - lat_station)**2 + (long - long_station)**2)**0.5')
        .sort_values(['id', 'distance'])
        .drop_duplicates('id')
        [['id', 'id_station']]
    )
    

    Both of the above produce the expected output:

    enter image description here