group-byformatdrop-duplicates

Pandas df.drop_duplicates() has no effect on multiple identical-like rows


When I run Pandas df.drop_duplicates() in python the duplicated lines don't disappear. It behaves just as if the three 3542.87 were different values (and indeed they are, read the end of my post).

a slice of my data:

print(df.iloc[234:237,:])
    Date        Libellé                     Montant     solde
291 04/11/2009  CHEQUE N°9667007                -65.00      5030.93
2   05/11/2009  DEBIT CARTE BANCAIRE DIFFERE            -1488.06    3542.87
131 05/11/2009  DEBIT CARTE BANCAIRE DIFFERE            -1488.06    3542.87
290 05/11/2009  DEBIT CARTE BANCAIRE DIFFERE            -1488.06    3542.87
1   05/11/2009  VIREMENT SUR PEA 3E350018328051 VOTRE OPERATI...-25.00      3517.87

My dataframe has unusual dtypes:

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9404 entries, 111 to 8814
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Date     9404 non-null   object 
 1   Libellé  9404 non-null   object 
 2   Montant  9404 non-null   float64
 3   solde    9404 non-null   float64
dtypes: float64(2), object(2)
memory usage: 367.3+ KB

As "Date" are datetime.date objects and "Libellé" are strings.

After if I try df.groupby(['Date', 'Libellé', 'Montant'])['solde'].nunique()

On aggregated row (05/11/2009, DEBIT CARTE BANCAIRE DIFFERE, -1488.06) it returns 3 instead of 1.

Do you have any idea on how this is possible and how to clean my solde values.

Thanks.

I tried to force float format using .astype(float) And also: all4.iloc[233:238,:].groupby(['Date', 'Libellé', 'Montant'])['solde'].apply(list) Out:

Date        Libellé                                                              Montant 
2009-11-04  CHEQUE N°9667007                                                     -65.00                                             [5030.93]
2009-11-05  DEBIT CARTE BANCAIRE DIFFERE                                         -1488.06    [3542.87, 3542.8700000000003, 3542.870000000001]
            VIREMENT SUR PEA  3E350018328051 VOTRE OPERATION DU    04/11/2009    -25.00                                             [3517.87]
Name: solde, dtype: object

Now the question is how to clean these data and avoid this mistake in the future.


Solution

  • df['solde'] = df['solde'].round(6)
    

    Solved my issue, however the data were extracted from .csv, .xls and .xlsx files maybe it could explain the bad import : [3542.87, 3542.8700000000003, 3542.870000000001] In hope this would help somebody facing this tricky problem.

    After checking it doesn't come from import but from calculation inside my python code. I further investigate and let you know.

    The strange thing is that editor doesn't show more than 2 decimals vs code variables editor

    Poking around in the code, I came across this:

    In:

    10166.89 + 0.98
    

    Out:

    10167.869999999999
    

    Typical floating point numbers issue ! Fixed using round()