Convert to float, pandas string column with mixed thousand and decimal separators

I have a pandas DataFrame with a column containing strings representing numbers. These strings have mixed formats. Some times numbers use comma as a decimal separator and sometimes a dot. When a dot is used as a decimal separator, that number can contain comma as a thousand separator.

For example:

import pandas as pd
data = {
    'NumberString': [
        '1,234.56',
        '789,012.34',
        '45,678',
        '9,876.54',
        '3,210.98',
        '1,000,000.01',
        '123.45',
        '42,000',
        'NaN'
    ]
}
df = pd.DataFrame(data)

I want to convert this column to numeric without losing some of the data due to inconsistent format (commas vs dots). However, using pd.to_numeric with parameter errors='coerce' will drop down some of the number Python

Is there a way to format all the strings to numbers without loosing them due to format?

What I have tried so far:

>>> df['Number'] = pd.to_numeric(df['NumberString'].str.replace(',','.'), errors='coerce')

NumberString  Number
1,234.56      NaN
789,012.34    NaN
45,678        45.678
9,876.54      NaN
3,210.98      NaN
1,000,000.01  NaN
123.45        123.450
42,000        42.000
NaN           NaN

Desired output:

NumberString  Number
1,234.56      1234.56 
789,012.34    789012.34
45,678        45.678
9,876.54      9876.54
3,210.98      3210.98
1,000,000.01  1000000.01
123.45        123.450
42,000        42.000
NaN           NaN

Solution

If you have mixed formats, you could first try to str.replace the commas by a dot and convert to_numeric with errors='coerce', then fillna with an attempt converting the commas to empty string:

df['Number'] = (pd.to_numeric(df['NumberString'].str.replace(',', '.'), errors='coerce')
                  .fillna(pd.to_numeric(df['NumberString'].str.replace(',', ''), errors='coerce'))
               )

Output:

   NumberString       Number
0      1,234.56     1234.560
1    789,012.34   789012.340
2        45,678       45.678
3      9,876.54     9876.540
4      3,210.98     3210.980
5  1,000,000.01  1000000.010
6        123.45      123.450
7        42,000       42.000
8           NaN          NaN

what about `1.000.000,01`/`1.000,01`?

If this is desired, then you can use a custom function. Count the number of ./, and decide based on those. If a single . or , and the other has zero or more than 1, then remove the other character and convert. If one of each, use the last one as decimal separator. If more that 1 for each, return NaN (you could also use a try/except to catch all invalid strings).

def to_num(s):
    d = s.count('.')
    c = s.count(',')
    if d<=1 and c != 1:
        return float(s.replace(',', ''))
    if c<=1 and d != 1:
        return float(s.replace('.', '').replace(',', '.'))
    if c>1 and d>1:
        return float('nan')
    s2 = s[::-1]
    if s2.index('.') < s2.index(','):
        return float(s.replace(',', ''))
    else:
        return float(s.replace('.', '').replace(',', '.'))
        
df['Number'] = df['NumberString'].map(to_num)

Output:

    NumberString       Number
0       1,234.56     1234.560
1     789,012.34   789012.340
2         45,678       45.678
3       9,876.54     9876.540
4       3,210.98     3210.980
5   1,000,000.01  1000000.010
6         123.45      123.450
7         42,000       42.000
8            NaN          NaN
9   1.000.000,01  1000000.010
10     1,000,000  1000000.000
11     1.000.000  1000000.000
12      1,000.01     1000.010
13      1.000,01     1000.010
14     1.2.3,4,5          NaN

Convert to float, pandas string column with mixed thousand and decimal separators

what about 1.000.000,01/1.000,01?

what about `1.000.000,01`/`1.000,01`?