pythonpandasdata-scienceoutliersiqr

ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')


Context

I'm trying to find outliers in all columns of a dataframe with python.

Steps:

  1. Created a function to find outliers via IQR
  2. Tested the function on one column.
  3. Implemented the function on all columns with a for loop.

My level

I'm completely new to Machine learning and data science. I only know python and pandas so I'm currently expanding my knowledge in machine learning. I don't know a lot of theory about which data types machine learning algorithms can handle and why missing values are a problem, etc.

Overview of the data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2768 entries, 14421 to 98025
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   date                   2768 non-null   datetime64[ns]
 1   location               2768 non-null   object        
 2   new_deaths             2768 non-null   float64       
 3   female_smokers         2768 non-null   float64       
 4   male_smokers           2768 non-null   float64       
 5   population             2768 non-null   float64       
 6   people_vaccinated      2768 non-null   float64       
 7   cardiovasc_death_rate  2768 non-null   float64       
 8   aged_65_older          2768 non-null   float64       
 9   gdp_per_capita         2768 non-null   float64     
..... #The rest are indicator columns with dummy values that were categorical columns before.  
dtypes: datetime64[ns](1), float64(8), object(1)

Code to find outliers in one column

I created a function to find the IQR and will return the indices and values of the outliers.

def find_outliers_tukey(x):
  q1 = np.percentile(x, 25)
  q3 = np.percentile(x, 75)

  iqr = q3-q1
  floor = q1 -1.5*iqr
  ceiling = q3 +1.5*iqr

  outlier_indices = list(x.index[ (x < floor)|(x > ceiling) ])
  outlier_values = list(x[outlier_indices])

  return outlier_indices, outlier_values

When I call the function:

tukey_indices, tukey_values = find_outliers_tukey(df.new_deaths)
print(f"Outliers in new deatths are {np.sort(tukey_values)}")

output:

Outliers in new deatths are []

Question 1

Why is this giving me no outliers? Look below

# Statistics of the new deaths column

Mean = 145.745266
std = 796.284067    
min = -1918.000000
25% = 0.000000
50% = 2.000000
75% = 18.000000
max = 18000.000000

Note: Looking at the stats, there's probably something seriously wrong with the data

Code to find outliers in all columns (for loop)

for feature in df.columns:
  tukey_indices, tukey_values = find_outliers_tukey(feature)
  print(f"Outliers in {feature} are {tukey_values} \n")

output:

UFuncTypeError                            Traceback (most recent call last)
<ipython-input-16-b01dad9e55a2> in <module>()
      1 for feature in df.columns:
----> 2   tukey_indices, tukey_values = find_outliers_tukey(feature)
      3   print(f"Outliers in {feature} are {tukey_values} \n")

4 frames
<__array_function__ internals> in percentile(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py in _quantile_ureduce_func(a, q, axis, out, overwrite_input, interpolation, keepdims)
   3965             n = np.isnan(ap[-1:, ...])
   3966 
-> 3967         x1 = take(ap, indices_below, axis=axis) * weights_below
   3968         x2 = take(ap, indices_above, axis=axis) * weights_above
   3969 

UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')

Question 2

What does this error mean/ why am I getting this?


Solution

  • The problem was probably with the numpy function 'percentile' and how I passed in my argument to the find_outliers_tukey function. So these changes worked for me

    step 1

    1. Include two arguments; one for the name of df, another for the name of the feature.
    2. Put the feature argument into the df explicitly.
    3. Don't use attribute chaining when accessing the feature and use quantile instead of percentile.
    def find_outliers_tukey(df:"dataframe", feature:"series") -> "list, list":
      "write later"
    
      q1 = df[feature].quantile(0.25)
      q3 = df[feature].quantile(0.75)
    
      iqr = q3-q1
      floor = q1 -1.5*iqr
      ceiling = q3 +1.5*iqr
    
      outlier_indices = list(df.index[ (df[feature] < floor) | (df[feature] > ceiling) ])
      #outlier_values = list(df[feature][outlier_indices]) 
    
      #print(f"outliers are {outlier_values} at indices {outlier_indices}")
      #return outlier_indices, outlier_values
      return outlier_indices
    

    step 2

    I put all the columns I wanted to remove outliers from into a list.

    df_columns = list(df.columns[1:56])
    

    step 3

    no change here. Just used 2 arguments instead of 1 for the find_outliers_tukey function. Oh and I stored the indices of the outliers just for future use.

    index_list = []
    
    for feature in df_columns: 
      index_list.extend(find_outliers_tukey(df, feature))
    
    

    This gave me better statistical results for the columns.