When I run Pandas df.drop_duplicates() in python the duplicated lines don't disappear. It behaves just as if the three 3542.87 were different values (and indeed they are, read the end of my post).
a slice of my data:
print(df.iloc[234:237,:])
Date Libellé Montant solde
291 04/11/2009 CHEQUE N°9667007 -65.00 5030.93
2 05/11/2009 DEBIT CARTE BANCAIRE DIFFERE -1488.06 3542.87
131 05/11/2009 DEBIT CARTE BANCAIRE DIFFERE -1488.06 3542.87
290 05/11/2009 DEBIT CARTE BANCAIRE DIFFERE -1488.06 3542.87
1 05/11/2009 VIREMENT SUR PEA 3E350018328051 VOTRE OPERATI...-25.00 3517.87
My dataframe has unusual dtypes:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9404 entries, 111 to 8814
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 9404 non-null object
1 Libellé 9404 non-null object
2 Montant 9404 non-null float64
3 solde 9404 non-null float64
dtypes: float64(2), object(2)
memory usage: 367.3+ KB
As "Date" are datetime.date objects and "Libellé" are strings.
After if I try df.groupby(['Date', 'Libellé', 'Montant'])['solde'].nunique()
On aggregated row (05/11/2009, DEBIT CARTE BANCAIRE DIFFERE, -1488.06)
it returns 3 instead of 1.
Do you have any idea on how this is possible and how to clean my solde values.
Thanks.
I tried to force float format using .astype(float)
And also:
all4.iloc[233:238,:].groupby(['Date', 'Libellé', 'Montant'])['solde'].apply(list)
Out:
Date Libellé Montant
2009-11-04 CHEQUE N°9667007 -65.00 [5030.93]
2009-11-05 DEBIT CARTE BANCAIRE DIFFERE -1488.06 [3542.87, 3542.8700000000003, 3542.870000000001]
VIREMENT SUR PEA 3E350018328051 VOTRE OPERATION DU 04/11/2009 -25.00 [3517.87]
Name: solde, dtype: object
Now the question is how to clean these data and avoid this mistake in the future.
df['solde'] = df['solde'].round(6)
Solved my issue, however the data were extracted from .csv, .xls and .xlsx files maybe it could explain the bad import : [3542.87, 3542.8700000000003, 3542.870000000001]
In hope this would help somebody facing this tricky problem.
After checking it doesn't come from import but from calculation inside my python code. I further investigate and let you know.
The strange thing is that editor doesn't show more than 2 decimals vs code variables editor
Poking around in the code, I came across this:
In:
10166.89 + 0.98
Out:
10167.869999999999
Typical floating point numbers issue ! Fixed using round()