pythonpandasdataframe

Convert to float, pandas string column with mixed thousand and decimal separators


I have a pandas DataFrame with a column containing strings representing numbers. These strings have mixed formats. Some times numbers use comma as a decimal separator and sometimes a dot. When a dot is used as a decimal separator, that number can contain comma as a thousand separator.

For example:

import pandas as pd
data = {
    'NumberString': [
        '1,234.56',
        '789,012.34',
        '45,678',
        '9,876.54',
        '3,210.98',
        '1,000,000.01',
        '123.45',
        '42,000',
        'NaN'
    ]
}
df = pd.DataFrame(data)

I want to convert this column to numeric without losing some of the data due to inconsistent format (commas vs dots). However, using pd.to_numeric with parameter errors='coerce' will drop down some of the number Python

Is there a way to format all the strings to numbers without loosing them due to format?

What I have tried so far:

>>> df['Number'] = pd.to_numeric(df['NumberString'].str.replace(',','.'), errors='coerce')
NumberString  Number
1,234.56      NaN
789,012.34    NaN
45,678        45.678
9,876.54      NaN
3,210.98      NaN
1,000,000.01  NaN
123.45        123.450
42,000        42.000
NaN           NaN

Desired output:

NumberString  Number
1,234.56      1234.56 
789,012.34    789012.34
45,678        45.678
9,876.54      9876.54
3,210.98      3210.98
1,000,000.01  1000000.01
123.45        123.450
42,000        42.000
NaN           NaN

Solution

  • If you have mixed formats, you could first try to str.replace the commas by a dot and convert to_numeric with errors='coerce', then fillna with an attempt converting the commas to empty string:

    df['Number'] = (pd.to_numeric(df['NumberString'].str.replace(',', '.'), errors='coerce')
                      .fillna(pd.to_numeric(df['NumberString'].str.replace(',', ''), errors='coerce'))
                   )
    

    Output:

       NumberString       Number
    0      1,234.56     1234.560
    1    789,012.34   789012.340
    2        45,678       45.678
    3      9,876.54     9876.540
    4      3,210.98     3210.980
    5  1,000,000.01  1000000.010
    6        123.45      123.450
    7        42,000       42.000
    8           NaN          NaN
    
    what about 1.000.000,01/1.000,01?

    If this is desired, then you can use a custom function. Count the number of ./, and decide based on those. If a single . or , and the other has zero or more than 1, then remove the other character and convert. If one of each, use the last one as decimal separator. If more that 1 for each, return NaN (you could also use a try/except to catch all invalid strings).

    def to_num(s):
        d = s.count('.')
        c = s.count(',')
        if d<=1 and c != 1:
            return float(s.replace(',', ''))
        if c<=1 and d != 1:
            return float(s.replace('.', '').replace(',', '.'))
        if c>1 and d>1:
            return float('nan')
        s2 = s[::-1]
        if s2.index('.') < s2.index(','):
            return float(s.replace(',', ''))
        else:
            return float(s.replace('.', '').replace(',', '.'))
            
    df['Number'] = df['NumberString'].map(to_num)
    

    Output:

        NumberString       Number
    0       1,234.56     1234.560
    1     789,012.34   789012.340
    2         45,678       45.678
    3       9,876.54     9876.540
    4       3,210.98     3210.980
    5   1,000,000.01  1000000.010
    6         123.45      123.450
    7         42,000       42.000
    8            NaN          NaN
    9   1.000.000,01  1000000.010
    10     1,000,000  1000000.000
    11     1.000.000  1000000.000
    12      1,000.01     1000.010
    13      1.000,01     1000.010
    14     1.2.3,4,5          NaN