I have a dataset that contains 3 columns. These are edge connections between nodes and the strength of the connection. What I am trying to do is find and merge the extra edges that can occur when the direction goes in the opposite direction.
as a short example
data_frame = pd.DataFrame({"A":["aa", "aa", "aa", "bb", "bb", "cc", "dd", "dd"],
"B":["bb", "cc", "dd", "aa", "dd", "aa", "ee", "aa"],
"C":[4,3,4,5,3,4,2, 5]})
the resulting node graph
aa - bb
| \ |
cc dd -- ee
From the nodes, we have overlap as "aa - bb" is the same as "bb - aa" and same with "aa - dd" and "dd - aa"
I thought about merging A and B together both forward and reverse, concatenating the two dataframes and than performing a group_by().sum()
but I end up with extras that need to be removed afterwards.
ideally this is how it would work
A | B | C A | B | C
aa bb 4 aa bb 9
aa cc 3 aa cc 7
aa dd 4 aa dd 9
bb aa 5 bb dd 3
bb dd 3 --> dd ee 2
cc aa 4
dd ee 2
dd aa 5
You can aggregate as frozenset
, then perform a groupby.sum
:
out = (data_frame['C']
.groupby(data_frame[['A', 'B']].agg(frozenset, axis=1))
.sum()
.reset_index()
)
Output:
index C
0 (bb, aa) 9
1 (cc, aa) 7
2 (dd, aa) 9
3 (bb, dd) 3
4 (ee, dd) 2
Variant to get the original columns:
cols = ['A', 'B']
out = (data_frame
.groupby(data_frame[['A', 'B']].agg(frozenset, axis=1),
as_index=False)
.agg(dict.fromkeys(cols, 'first')|{'C': 'sum'})
)
Output:
A B C
0 aa bb 9
1 aa cc 7
2 aa dd 9
3 bb dd 3
4 dd ee 2
Since converting to frozenset
is quite slow, you can also sort
the values in a consistent order using numpy, and groupby.sum
:
import numpy as np
tmp = data_frame.copy() # avoid modifying the original frame
tmp[['A', 'B']] = np.sort(data_frame[['A', 'B']], axis=1)
out = tmp.groupby(['A', 'B'], as_index=False).sum()
Variant of the similar method suggested by @PandaKim, with improved efficiency:
cols = ['A', 'B']
out = (data_frame.groupby([*np.sort(data_frame[cols]).T])['C']
.sum().rename_axis(cols).reset_index()
)
Output:
A B C
0 aa bb 9
1 aa cc 7
2 aa dd 9
3 bb dd 3
4 dd ee 2