pythonpandas

pandas find duplicate pairs between 2 columns of data


I have a dataset that contains 3 columns. These are edge connections between nodes and the strength of the connection. What I am trying to do is find and merge the extra edges that can occur when the direction goes in the opposite direction.

as a short example

data_frame = pd.DataFrame({"A":["aa", "aa", "aa", "bb", "bb", "cc", "dd", "dd"],
                           "B":["bb", "cc", "dd", "aa", "dd", "aa", "ee", "aa"],
                           "C":[4,3,4,5,3,4,2, 5]})

the resulting node graph

 aa - bb
 |  \  |
 cc   dd -- ee

From the nodes, we have overlap as "aa - bb" is the same as "bb - aa" and same with "aa - dd" and "dd - aa"

I thought about merging A and B together both forward and reverse, concatenating the two dataframes and than performing a group_by().sum() but I end up with extras that need to be removed afterwards.

ideally this is how it would work

 A  |  B   |   C                                A  |  B   |   C
aa     bb      4                               aa     bb      9
aa     cc      3                               aa     cc      7
aa     dd      4                               aa     dd      9
bb     aa      5                               bb     dd      3
bb     dd      3                 -->           dd     ee      2
cc     aa      4                               
dd     ee      2
dd     aa      5

Solution

  • You can aggregate as frozenset, then perform a groupby.sum:

    out = (data_frame['C']
           .groupby(data_frame[['A', 'B']].agg(frozenset, axis=1))
           .sum()
           .reset_index()
          )
    

    Output:

          index  C
    0  (bb, aa)  9
    1  (cc, aa)  7
    2  (dd, aa)  9
    3  (bb, dd)  3
    4  (ee, dd)  2
    

    Variant to get the original columns:

    cols = ['A', 'B']
    
    out = (data_frame
           .groupby(data_frame[['A', 'B']].agg(frozenset, axis=1),
                    as_index=False)
           .agg(dict.fromkeys(cols, 'first')|{'C': 'sum'})
          )
    

    Output:

        A   B  C
    0  aa  bb  9
    1  aa  cc  7
    2  aa  dd  9
    3  bb  dd  3
    4  dd  ee  2
    

    Since converting to frozenset is quite slow, you can also sort the values in a consistent order using , and groupby.sum:

    import numpy as np
    tmp = data_frame.copy() # avoid modifying the original frame
    
    tmp[['A', 'B']] = np.sort(data_frame[['A', 'B']], axis=1)
    
    out = tmp.groupby(['A', 'B'], as_index=False).sum()
    

    Variant of the similar method suggested by @PandaKim, with improved efficiency:

    cols = ['A', 'B']
    out = (data_frame.groupby([*np.sort(data_frame[cols]).T])['C']
                     .sum().rename_axis(cols).reset_index()
          )
    

    Output:

        A   B  C
    0  aa  bb  9
    1  aa  cc  7
    2  aa  dd  9
    3  bb  dd  3
    4  dd  ee  2
    

    timings:

    enter image description here