I have a dataframe like this
df = pd.DataFrame({
'id': ['A','A','B','B','B'],
'x': [1,1,2,2,3],
'y': [1,2,2,3,3]
})
The output I want is the average distance for each point in the group, in this example
group A: (distance((1,1),(1,2))) /1 = 1
group B: (distance((2,2),(2,3)) + distance((2,3),(3,3)) + distance((2,2),(3,3))) /3 = 1.138
I can calculate the distance using np.linalg.norm
but I confused to use it in pandas groupby
. Thank you
Note: 1 of my idea is I am trying to make this dataframe first (where I stuck), which is contains the pairs of point that I need to calculate the distance and after this I just need to calculate distance and groupby mean
A possible solution, based on numpy broadcasting
:
def calc_avg_distance(group):
x = group[['x', 'y']].values
dist_matrix = np.sqrt(((x - x[:, np.newaxis])**2).sum(axis=2))
np.fill_diagonal(dist_matrix, np.nan)
avg_distance = np.nanmean(dist_matrix)
return avg_distance
(df.groupby('id').apply(lambda x: calc_avg_distance(x))
.reset_index(name='avg_distance'))
Output:
id avg_distance
0 A 1.000000
1 B 1.138071