pandas find duplicate pairs between 2 columns of data

I have a dataset that contains 3 columns. These are edge connections between nodes and the strength of the connection. What I am trying to do is find and merge the extra edges that can occur when the direction goes in the opposite direction.

as a short example

data_frame = pd.DataFrame({"A":["aa", "aa", "aa", "bb", "bb", "cc", "dd", "dd"],
                           "B":["bb", "cc", "dd", "aa", "dd", "aa", "ee", "aa"],
                           "C":[4,3,4,5,3,4,2, 5]})

the resulting node graph

 aa - bb
 |  \  |
 cc   dd -- ee

From the nodes, we have overlap as "aa - bb" is the same as "bb - aa" and same with "aa - dd" and "dd - aa"

I thought about merging A and B together both forward and reverse, concatenating the two dataframes and than performing a group_by().sum() but I end up with extras that need to be removed afterwards.

ideally this is how it would work

 A  |  B   |   C                                A  |  B   |   C
aa     bb      4                               aa     bb      9
aa     cc      3                               aa     cc      7
aa     dd      4                               aa     dd      9
bb     aa      5                               bb     dd      3
bb     dd      3                 -->           dd     ee      2
cc     aa      4                               
dd     ee      2
dd     aa      5

Solution

You can aggregate as frozenset, then perform a groupby.sum:

out = (data_frame['C']
       .groupby(data_frame[['A', 'B']].agg(frozenset, axis=1))
       .sum()
       .reset_index()
      )

Output:

      index  C
0  (bb, aa)  9
1  (cc, aa)  7
2  (dd, aa)  9
3  (bb, dd)  3
4  (ee, dd)  2

Variant to get the original columns:

cols = ['A', 'B']

out = (data_frame
       .groupby(data_frame[['A', 'B']].agg(frozenset, axis=1),
                as_index=False)
       .agg(dict.fromkeys(cols, 'first')|{'C': 'sum'})
      )

Output:

    A   B  C
0  aa  bb  9
1  aa  cc  7
2  aa  dd  9
3  bb  dd  3
4  dd  ee  2

Since converting to frozenset is quite slow, you can also sort the values in a consistent order using numpy, and groupby.sum:

import numpy as np
tmp = data_frame.copy() # avoid modifying the original frame

tmp[['A', 'B']] = np.sort(data_frame[['A', 'B']], axis=1)

out = tmp.groupby(['A', 'B'], as_index=False).sum()

Variant of the similar method suggested by @PandaKim, with improved efficiency:

cols = ['A', 'B']
out = (data_frame.groupby([*np.sort(data_frame[cols]).T])['C']
                 .sum().rename_axis(cols).reset_index()
      )

Output:

    A   B  C
0  aa  bb  9
1  aa  cc  7
2  aa  dd  9
3  bb  dd  3
4  dd  ee  2

pandas find duplicate pairs between 2 columns of data

timings: