I have two dataframes, the first one has the longitude and latitude for the meters, and the second dataframe has the longitude and latitude of the stations.
I am trying to link them by closest match.
Here is my example:
df_id = pd.DataFrame({
'id': [1, 2],
'lat': [32, 55],
'long': [-89, -8]
})
df_station = pd.DataFrame({
'id': [10, 20],
'lat': [33, 56],
'long': [-88.23, -7]
})
and here is the expected output:
Pandas' cross merge should help to pair ids and stations. Once the two frames are merged, we can find the Euclidean distance between each lat-long pairs and filter the closest stations to each id.
# cross merge the 2 dfs to pair all ids with stations
df_merged = df_id.merge(df_station.add_suffix('_station'), how='cross')
# find euclidean distance between all pairs of locations
df_merged['distance'] = ((df_merged.lat - df_merged.lat_station)**2 + (df_merged.long - df_merged.long_station)**2).pow(0.5)
# filter the closest station for each id
df = df_merged.loc[df_merged.groupby('id')['distance'].idxmin(), ['id', 'id_station']]
The final step of filtering can be replaced with sort_values
+ drop_duplicates
as well. The idea is that once we sort by distance, for each id, each additional row has greater distance
value than the first distance
. Then we can write the entire thing in a nice little chained one-liner.
df = (
df_id
.merge(df_station.add_suffix('_station'), how='cross')
.eval('distance = ((lat - lat_station)**2 + (long - long_station)**2)**0.5')
.sort_values(['id', 'distance'])
.drop_duplicates('id')
[['id', 'id_station']]
)
Both of the above produce the expected output: