pythonpandasdrop

Remove duplicated values appear in two columns in dataframe


I have table similar to this one:

index   name_1     path1        name_2       path2
0       Roy       path/to/Roy     Anne      path/to/Anne
1       Anne      path/to/Anne     Roy      path/to/Roy 
2       Hari      path/to/Hari    Wili      path/to/Wili
3       Wili      path/to/Wili    Hari      path/to/Hari
4       Miko      path/to/miko     Lin      path/to/lin
5       Miko      path/to/miko     Dan      path/to/dan
6       Lin       path/to/lin     Miko      path/to/miko
7       Lin       path/to/lin     Dan       path/to/dan
8       Dan       path/to/dan     Miko      path/to/miko
9       Dan       path/to/dan     Lin       path/to/lin
...

As you can see, the table kind of showing relationship between entities -
Roi is with Anne,
Wili with Hari,
Lin with Dan and with Miko.

The table is actually showing overlap data , meaning, Hari and wili for example, have the same document, and I would like to remove one of them not to have duplicated files. In order to do this, I would like to create new table that has only one name in relationship, so I can later create list of paths to remove.

The result table will look like this :

index   name_1     path1        name_2       path2
0       Roy       path/to/Roy      Anne      path/to/Anne
1       Hari      path/to/Hari     Wili      path/to/Wili
2       Miko      path/to/miko     Lin       path/to/lin
3       Miko      path/to/miko     Dan       path/to/dan

The idea is that I'll use the values of "path2" to remove files with this path, and will still have the files in path1. for that reason, this line:

4       Lin       path/to/lin    Dan       path/to/dan

is missing, as it will be removed using Miko... any ideas how to do this ? :)

Edit:

I have tried this based on this answer:

df_2= df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]

And it's true that I get less rows in my dataframe (it has 695 and I got now 402) , but, I still have the first lines like this:

index   name_1     path1        name_2       path2
0       Roy       path/to/Roy     Anne      path/to/Anne
1       Anne      path/to/Anne     Roy      path/to/Roy 
...

meaning I still get the same issue


Solution

  • You can use frozenset to detect duplicates:

    out = (df[~df[['name_1', 'name_2']].agg(frozenset, axis=1).duplicated()]
               .loc[lambda x: ~x['path2'].isin(x['path1'])])
    
    # OR
    
    out = (df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
               .query('~path1.isin(path2)'))
    

    Output:

    >>> out
      name_1         path1 name_2         path2
    0    Roy   path/to/Roy   Anne  path/to/Anne
    2   Hari  path/to/Hari   Wili  path/to/Wili
    5   Miko  path/to/miko    Dan   path/to/dan
    7    Lin   path/to/lin    Dan   path/to/dan