pythonpandasfunctionfillna

Why am I getting a 'float' has no attribute 'fillna' error when using fillna inside a function in Pandas?


Why don't fillna and other functions work inside a function?

I have a DataFrame with 10 columns. I would like to write a function taking each column and creating multiple columns. My final DataFrame would be 50 columns.

def newVars(df,col='my_var'):
    df[col+'_filled'] = df[col].fillna(0)
    df[col+'_rank'] = df[col].fillna(0).rank()
    df[col+'_percentile'] = df[col].fillna(0).rank(pct=True)
    df[col+'_halved'] = df[col]/2
    return df

new_df = df.apply(newVars, axis=1)

I get the error: 'float' has no attribute 'fillna'

I am expecting a DataFrame with 5 times the columns of my initial DataFrame. If I take the line outside of the function it works fine:

df['my_var_filled'] = df['my_var].fillna(0)


Solution

  • apply doesn't really make sense in your context.

    It rather looks like you should pass the DataFrame to the function:

    df = pd.DataFrame({'my_var': [1,3,20]})
    
    def newVars(df,col='my_var'):
        df[col+'_filled'] = df[col].fillna(0)
        df[col+'_rank'] = df[col].fillna(0).rank()
        df[col+'_percentile'] = df[col].fillna(0).rank(pct=True)
        df[col+'_halved'] = df[col]/2
        return df
    
    new_df = newVarsars(df)
    

    Or use pipe:

    df = pd.DataFrame({'my_var': [1,3,20]})
    
    def newVars(df,col='my_var'):
        df[col+'_filled'] = df[col].fillna(0)
        df[col+'_rank'] = df[col].fillna(0).rank()
        df[col+'_percentile'] = df[col].fillna(0).rank(pct=True)
        df[col+'_halved'] = df[col]/2
        return df
    
    new_df = df.pipe(newVarsars)
    

    Output:

       my_var  my_var_filled  my_var_rank  my_var_percentile  my_var_halved
    0       1              1          1.0           0.333333            0.5
    1       3              3          2.0           0.666667            1.5
    2      20             20          3.0           1.000000           10.0
    

    Note that in both cases your function mutates df in place and outputs it. I would recommend to do one or the other, not both.