pythonpandasexcelduplicatesdrop-duplicates

Finding duplicates across multiple sheets in an excel corresponding to the first column using python


I have an excel file with multiple sheets containing when and where employees went for a sale. The columns are different in all sheets. Example

Sheet 1

 Date/Place      PlaceA  PlaceB   PlaceD   PlaceE   PlaceF  PlaceG
2019-03-01        A        B        C        D        E        F
2019-03-02        A        B        C        D        E        F
2019-03-03        A        B        C        D        E        F
2019-03-08        A        B        C        D        E        F
2019-03-05        A        B        C        T        E        F
2019-03-06        G        B        K        D        N        Y

Sheet 2

 Date/Place         PlaceE   PlaceF  PlaceG PlaceH PlaceI
2019-03-05           T        E        U      A     B
2019-03-06           X        P        Y      N     H

I need to find the duplicates, corresponding to the Date and Place and replace it with an @ symbol using python

The output will be an excel will the same sheets

Sheet 1

 Date/Place      PlaceA  PlaceB   PlaceD   PlaceE   PlaceF  PlaceG
2019-03-01        A        B        C        D        E        F
2019-03-02        A        B        C        D        E        F
2019-03-03        A        B        C        D        E        F
2019-03-08        A        B        C        D        E        F
2019-03-05        A        B        C        T        E        F
2019-03-06        G        B        K        D        N        Y

Sheet 2

 Date/Place         PlaceE   PlaceF  PlaceG PlaceH PlaceI
2019-03-05           @        @        U      A     B
2019-03-06           X        P        @      N     H

The data - T was in PlaceE on 2019-03-05 is seen in both Sheet1 and Sheet2, hence in Sheet2 it is replaced with @. There are multiple sheets in the file and the duplicates needs to checked across all sheets. Thanks in advance for the answer!

I tried using the df.duplicated function from pandas, but it checks only the same sheet and also, the since this is not direct row duplication, the duplication is based on the first column, it did not work out


Solution

  • Assuming df1/df2, you can align df1 on df2 and mask the cells that are different, except in Date/Place:

    tmp = df2.set_index('Date/Place')
    
    out = tmp.mask(tmp.eq(df1.set_index('Date/Place')), '@').reset_index()
    

    Or:

    out = df2.mask(
     df2[['Date/Place']].merge(df1, how='left')
     .set_axis(df2.index)
     .eq(df2).assign(**{'Date/Place': False}),
     '@'
     )
    

    Output:

       Date/Place PlaceE PlaceF PlaceG PlaceH PlaceI
    0  2019-03-05      @      @      U      A      B
    1  2019-03-06      X      P      @      N      H