I am working with two DataFrames
, defined as such:
import pandas as pd
df1 = pd.DataFrame([[1, 'a', 0.95], [2, 'b', 0.92], [3, 'c', 0.91]], columns=['id','value','similarity'])
df2 = pd.DataFrame([[3, 'c', 0.93], [4, 'd', 0.92], [5, 'e', 0.99]], columns=['id','value','similarity'])
df1
id name similarity
1 a 0.95
2 b 0.92
3 c 0.91
df2
id name similarity
3 c 0.93
4 d 0.92
5 e 0.99
I now want to combine both DataFrames
by the key attribute id
, while avoiding the creation of NaN
s in the name
column. For this, I use pd.combine_first
, as suggested in this answer, to get this:
df3 = df1.set_index('id').combine_first(df2.set_index('id'))
df3
id name similarity
1 a 0.95
2 b 0.92
3 c 0.91
4 d 0.92
5 e 0.99
However, I have an additional requirement. When a similarity
value exists on both sets, I want to take their average as the new value. So for instance my desired output would be something like:
id name similarity
1 a 0.95
2 b 0.92
3 c 0.92 <-- (0.91 + 0.93 / 2)
4 d 0.92
5 e 0.99
Preferably using pandas
or numpy
, how can I average similarities when both DataFrames have a value?
You can concat
, then groupby.mean
:
out = pd.concat([df1, df2]).groupby(['id', 'value'], as_index=False).mean()
Output:
id value similarity
0 1 a 0.95
1 2 b 0.92
2 3 c 0.92
3 4 d 0.92
4 5 e 0.99
If you want to have the combine_first
behavior for some columns and the average for others, you could use groupby.agg
with first
/mean
depending on the columns:
df1['x'] = 'one'
df2['x'] = 'two'
out = (pd.concat([df1, df2]).groupby(['id', 'value'], as_index=False)
.agg({'similarity': 'mean', 'x': 'first'})
)
Output:
id value similarity x
0 1 a 0.95 one
1 2 b 0.92 one
2 3 c 0.92 one
3 4 d 0.92 two
4 5 e 0.99 two
import numpy as np
out = (pd.concat([df1.assign(w=0.7), df2.assign(w=0.3)])
.groupby(['id', 'value'])
.apply(lambda x: np.average(x['similarity'], weights=x['w']),
include_groups=False)
.reset_index(name='weighted_avg')
)
Output:
id value weighted_avg
0 1 a 0.950
1 2 b 0.920
2 3 c 0.916
3 4 d 0.920
4 5 e 0.990
or:
g = (pd.concat([df1.assign(w=0.7), df2.assign(w=0.3)])
.eval('weighted_avg = similarity * w')
.groupby(['id', 'value'])
)
out = (g.first()
.assign(weighted_avg=g['weighted_avg'].sum()/g['w'].sum())
.reset_index()
)
Output:
id value similarity w weighted_avg
0 1 a 0.95 0.7 0.950
1 2 b 0.92 0.7 0.920
2 3 c 0.91 0.7 0.916
3 4 d 0.92 0.3 0.920
4 5 e 0.99 0.3 0.990