I'm trying to find outliers in all columns of a dataframe with python.
Steps:
I'm completely new to Machine learning and data science. I only know python and pandas so I'm currently expanding my knowledge in machine learning. I don't know a lot of theory about which data types machine learning algorithms can handle and why missing values are a problem, etc.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2768 entries, 14421 to 98025
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 2768 non-null datetime64[ns]
1 location 2768 non-null object
2 new_deaths 2768 non-null float64
3 female_smokers 2768 non-null float64
4 male_smokers 2768 non-null float64
5 population 2768 non-null float64
6 people_vaccinated 2768 non-null float64
7 cardiovasc_death_rate 2768 non-null float64
8 aged_65_older 2768 non-null float64
9 gdp_per_capita 2768 non-null float64
..... #The rest are indicator columns with dummy values that were categorical columns before.
dtypes: datetime64[ns](1), float64(8), object(1)
I created a function to find the IQR and will return the indices and values of the outliers.
def find_outliers_tukey(x):
q1 = np.percentile(x, 25)
q3 = np.percentile(x, 75)
iqr = q3-q1
floor = q1 -1.5*iqr
ceiling = q3 +1.5*iqr
outlier_indices = list(x.index[ (x < floor)|(x > ceiling) ])
outlier_values = list(x[outlier_indices])
return outlier_indices, outlier_values
When I call the function:
tukey_indices, tukey_values = find_outliers_tukey(df.new_deaths)
print(f"Outliers in new deatths are {np.sort(tukey_values)}")
output:
Outliers in new deatths are []
Why is this giving me no outliers? Look below
# Statistics of the new deaths column
Mean = 145.745266
std = 796.284067
min = -1918.000000
25% = 0.000000
50% = 2.000000
75% = 18.000000
max = 18000.000000
Note: Looking at the stats, there's probably something seriously wrong with the data
for feature in df.columns:
tukey_indices, tukey_values = find_outliers_tukey(feature)
print(f"Outliers in {feature} are {tukey_values} \n")
output:
UFuncTypeError Traceback (most recent call last)
<ipython-input-16-b01dad9e55a2> in <module>()
1 for feature in df.columns:
----> 2 tukey_indices, tukey_values = find_outliers_tukey(feature)
3 print(f"Outliers in {feature} are {tukey_values} \n")
4 frames
<__array_function__ internals> in percentile(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py in _quantile_ureduce_func(a, q, axis, out, overwrite_input, interpolation, keepdims)
3965 n = np.isnan(ap[-1:, ...])
3966
-> 3967 x1 = take(ap, indices_below, axis=axis) * weights_below
3968 x2 = take(ap, indices_above, axis=axis) * weights_above
3969
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')
What does this error mean/ why am I getting this?
The problem was probably with the numpy function 'percentile' and how I passed in my argument to the find_outliers_tukey function. So these changes worked for me
def find_outliers_tukey(df:"dataframe", feature:"series") -> "list, list":
"write later"
q1 = df[feature].quantile(0.25)
q3 = df[feature].quantile(0.75)
iqr = q3-q1
floor = q1 -1.5*iqr
ceiling = q3 +1.5*iqr
outlier_indices = list(df.index[ (df[feature] < floor) | (df[feature] > ceiling) ])
#outlier_values = list(df[feature][outlier_indices])
#print(f"outliers are {outlier_values} at indices {outlier_indices}")
#return outlier_indices, outlier_values
return outlier_indices
I put all the columns I wanted to remove outliers from into a list.
df_columns = list(df.columns[1:56])
no change here. Just used 2 arguments instead of 1 for the find_outliers_tukey function. Oh and I stored the indices of the outliers just for future use.
index_list = []
for feature in df_columns:
index_list.extend(find_outliers_tukey(df, feature))
This gave me better statistical results for the columns.