pythonpandasmatplotlibobjectinteger

Converting values in dataframe from object to integer/float but they're still functioning as objects


I have a set of values in my dataframe that are objects written as "100%", "75%", etc I'm converting these to integers (100, 75, etc)

this is the function I have

def convert_object_to_int(column):                                           
    column = column.astype(str)                                              
    column = column.str.rstrip('%')                                          
    column = pd.to_numeric(column, errors='coerce')                          
    column = column.fillna(column.median())                                  
    return column.astype(int)

After calling the function with this:

a1data.loc[:, 'Total(%)'] = convert_object_to_int(a1data['Total(%)'])

My Total(%) column still shows up as an Object when I check a1data.dtypes()

The numbers HAVE changed, and I am able to use them in visualisations and stuff, HOWEVER, I am unable to operate basic descriptive statistics on the data as it gives me the categorical descriptions instead.

I'm very much a beginner so any pointers would be greatly appreciated.

I've tried converting to floats instead as I read there used to be some issues with int64. A lot of the lines in the function kinda feel unnecessary, but the numbers weren't changing properly until all those lines were there. The numbers are now showing what I want them to but they still count as objects for descriptive statistics and other functions.


Solution

  • This is because you assign to the existing Series with a1data.loc[:, 'Total(%)'], which maintains the original dtype. Instead, overwrite with a new Series:

    a1data['Total(%)'] = convert_object_to_int(a1data['Total(%)'])
    
    print(a1data.dtypes)
    # Total(%)    int64
    # dtype: object
    

    Also note that you do not need to reassign all intermediates in your function, you could simplify it to:

    def convert_object_to_int(column):                                           
        column = pd.to_numeric(column.astype(str)
                                     .str.rstrip('%'),
                               errors='coerce')
        return column.fillna(column.median()).astype(int)
    

    Or without any variable:

    def convert_object_to_int(column):                                           
        return (pd.to_numeric(column.astype(str)
                                    .str.rstrip('%'),
                              errors='coerce')
                  .pipe(lambda x: x.fillna(x.median()))
                  .astype(int)
               )