I want to control for non-numeric entries in my pandas dataframe. Say I have the following:
>>> df
col_1 col_2 col_3
0 0.01 NaN 0.1
1 NaN 0.9 0.2
2 0.01 NaN 0.3
3 0.01 0.9 0.4
I can take the row-means as follows, while python properly skips over the NaN values:
>>> df.mean(axis=1)
0 0.055000
1 0.550000
2 0.155000
3 0.436667
dtype: float64
Great!. but now suppose one of the values from my imported table is a string
>>> df.iloc[0,1]="str1"
>>> df
col_1 col_2 col_3
0 0.01 str1 0.1
1 NaN 0.9 0.2
2 0.01 NaN 0.3
3 0.01 0.9 0.4
>>> df.mean(axis=1)
0 0.055
1 0.200
2 0.155
3 0.205
dtype: float64
DANGER: the output looks plausible, but is wrong, because once I changed the value in position [0,1]
to a string, the values in position [1,1]
and [3,1]
changed from being the number 0.9
to become the string "0.9"
, and all the strings are omitted from averaging (I guess each column has to be of the same type? there's probably a reason, but boy this is dangerously subtle.)
What I want to do now is force all the entries of the dataframe back into numeric type. Anything that can be sensibly coerced into a number should become that number, and anything that cannot should become nan
(regardless of what string or type it might have been).
Pandas series have a function pandas.to_numeric
where you can set errors='coerce'
, but unfortunately the analogous function for df's (DataFrame.astype()
) doesn't allow this option.
Is there a function for "make every element of the dataFrame that looks like a number numeric, and make everything else nan
"?
I think you can use to_numeric
on a subset of columns with apply
. This answer might help.