I have a pandas DataFrame with a column containing strings representing numbers. These strings have mixed formats. Some times numbers use comma as a decimal separator and sometimes a dot. When a dot is used as a decimal separator, that number can contain comma as a thousand separator.
For example:
import pandas as pd
data = {
'NumberString': [
'1,234.56',
'789,012.34',
'45,678',
'9,876.54',
'3,210.98',
'1,000,000.01',
'123.45',
'42,000',
'NaN'
]
}
df = pd.DataFrame(data)
I want to convert this column to numeric without losing some of the data due to inconsistent format (commas vs dots). However, using pd.to_numeric with parameter errors='coerce' will drop down some of the number Python
Is there a way to format all the strings to numbers without loosing them due to format?
What I have tried so far:
>>> df['Number'] = pd.to_numeric(df['NumberString'].str.replace(',','.'), errors='coerce')
NumberString Number
1,234.56 NaN
789,012.34 NaN
45,678 45.678
9,876.54 NaN
3,210.98 NaN
1,000,000.01 NaN
123.45 123.450
42,000 42.000
NaN NaN
Desired output:
NumberString Number
1,234.56 1234.56
789,012.34 789012.34
45,678 45.678
9,876.54 9876.54
3,210.98 3210.98
1,000,000.01 1000000.01
123.45 123.450
42,000 42.000
NaN NaN
If you have mixed formats, you could first try to str.replace
the commas by a dot and convert to_numeric
with errors='coerce'
, then fillna
with an attempt converting the commas to empty string:
df['Number'] = (pd.to_numeric(df['NumberString'].str.replace(',', '.'), errors='coerce')
.fillna(pd.to_numeric(df['NumberString'].str.replace(',', ''), errors='coerce'))
)
Output:
NumberString Number
0 1,234.56 1234.560
1 789,012.34 789012.340
2 45,678 45.678
3 9,876.54 9876.540
4 3,210.98 3210.980
5 1,000,000.01 1000000.010
6 123.45 123.450
7 42,000 42.000
8 NaN NaN
1.000.000,01
/1.000,01
?If this is desired, then you can use a custom function. Count the number of .
/,
and decide based on those. If a single .
or ,
and the other has zero or more than 1, then remove the other character and convert. If one of each, use the last one as decimal separator. If more that 1 for each, return NaN (you could also use a try/except to catch all invalid strings).
def to_num(s):
d = s.count('.')
c = s.count(',')
if d<=1 and c != 1:
return float(s.replace(',', ''))
if c<=1 and d != 1:
return float(s.replace('.', '').replace(',', '.'))
if c>1 and d>1:
return float('nan')
s2 = s[::-1]
if s2.index('.') < s2.index(','):
return float(s.replace(',', ''))
else:
return float(s.replace('.', '').replace(',', '.'))
df['Number'] = df['NumberString'].map(to_num)
Output:
NumberString Number
0 1,234.56 1234.560
1 789,012.34 789012.340
2 45,678 45.678
3 9,876.54 9876.540
4 3,210.98 3210.980
5 1,000,000.01 1000000.010
6 123.45 123.450
7 42,000 42.000
8 NaN NaN
9 1.000.000,01 1000000.010
10 1,000,000 1000000.000
11 1.000.000 1000000.000
12 1,000.01 1000.010
13 1.000,01 1000.010
14 1.2.3,4,5 NaN