pythonpandasoutliers

Python perform an column operation by group


I am trying to remove outliers columns (Property1, Property2 so forth) on a group level and need your help.

Here is my example dataframe with 2 groups and 4 properties. Edited below

Following is the code I used when I had only one group. I am using GESD method to remove outliers and using generalizedESD function of PyAstronomy library.

for column in dataset.iloc[:, 1:]:        #iterate through each column that needs outlier removed
    column1 = dataset[column].dropna()      #drop na
    column1.reset_index(drop=True, inplace=True)  # reset index to account for removed NAs
    length = len(column1)                        #calculate number of rows
    max_outlier = int(length*0.25)               #set max number of outliers
    if length>2:
        r = pyasl.generalizedESD(column1, max_outlier, 0.05, fullOutput=False)   
        i = r[1]                                                   # array of indices of the outliers  
        a = (column1.loc[i])                                        # array of outlier values
        for x in a:
            dataset.loc[dataset[column]==x, column] = np.nan   #replace outlier with nan

I would like to do this on a group level, where this operation runs 2 times for each of two groups and the result is the single df with outliers replaced with NAN. Can anyone suggest a way of doing this? I tried wrapping above code in a function and use groupby and apply. But I did not get desired result.

Edit 1: To answer question by JustLearning.

As per my understanding:dropna() removes NA rows from the column but does not reset the indices, which causes issue when pyasl function is applied. Because pyasl will identify indices of outliers based on new indices.

Using your example dataset below, I added some NAs. When you run this code, you will notice some of the outliers are not replaced. Because index 10 identified by pyasl as outlier in column (after dropna) is not same index in original column.

import numpy as np                                                                                      
import pandas as pd
import random                                                                                     
                                                                                                        
from PyAstronomy import pyasl                                                                           
                                                                                                        
# Create synthetic data of three groups with random outliers (values ~20)                                               
n = 50                                                                                                  
dataset = pd.DataFrame({                                                                                
    "Group":                                                                                            
    np.random.choice(["A", "B", "C"], size=n),                                                          
    "P1":                                                                                               
    np.random.rand(n) + np.random.choice([0., 20.], p=[0.9, 0.1], size=n),                              
    "P2":                                                                                               
    np.random.rand(n) + np.random.choice([0., 20.], p=[0.9, 0.1], size=n),                              
    "P3":                                                                                               
    np.random.rand(n) + np.random.choice([0., 20.], p=[0.9, 0.1], size=n),                              
    "P4":                                                                                               
    np.random.rand(n) + np.random.choice([0., 20.], p=[0.9, 0.1], size=n)                               
})  

#add NAs to property columns
p = 0.2
mask = np.random.choice([True, False], size=[50,4], p =[p,1-p])
dataset.iloc[:,1:] =  dataset.iloc[:,1:].mask(mask)  

dataset = dataset.set_index("Group")   

#function to replace outlier by column
def replace_outliers_columnwise(col):                                                                   
    max_outlier = len(col) // 4  # Your choice, be careful if col has only a few rows.                                                                         
    _, idxs = pyasl.generalizedESD(col, max_outlier)                                                        
    col.iloc[idxs] = np.nan                                                                             
    return col  
def replace_outliers_groupwise(group):                                                                  
    return group.apply(replace_outliers_columnwise)                                                     
                                                                                                                                                                                                              
dataset = dataset.groupby("Group").apply(replace_outliers_groupwise)                                    
print(dataset)                                                                                        
                                                                                                                                                   

Solution

  • First off, your example is prone to failing because if the number of columns per group is too low, you end up with max_outlier = 0, which generalizedESD doesn't like. Therefore, let's start with a better example dataset:

    import numpy as np                                                                                      
    import pandas as pd                                                                                     
                                                                                                            
    from PyAstronomy import pyasl                                                                           
                                                                                                            
    # Create synthetic data of three groups with random outliers (values ~20)                                               
    n = 50                                                                                                  
    dataset = pd.DataFrame({                                                                                
        "Group":                                                                                            
        np.random.choice(["A", "B", "C"], size=n),                                                          
        "P1":                                                                                               
        np.random.rand(n) + np.random.choice([0., 20.], p=[0.9, 0.1], size=n),                              
        "P2":                                                                                               
        np.random.rand(n) + np.random.choice([0., 20.], p=[0.9, 0.1], size=n),                              
        "P3":                                                                                               
        np.random.rand(n) + np.random.choice([0., 20.], p=[0.9, 0.1], size=n),                              
        "P4":                                                                                               
        np.random.rand(n) + np.random.choice([0., 20.], p=[0.9, 0.1], size=n)                               
    })                                                                                                      
                                                                                                            
    dataset = dataset.set_index("Group")
    

    Next, you can replace values on the whole dataset at once column-by-column like this:

    def replace_outliers_columnwise(col):                                                                   
        max_outlier = len(col) // 4  # Your choice, be careful if col has only a few rows.                                                                         
        _, idxs = pyasl.generalizedESD(col, max_outlier)                                                        
        col.iloc[idxs] = np.nan                                                                             
        return col                                                                                          
                                                                                                            
    dataset = dataset.apply(replace_outliers_columnwise, axis=0)                                            
    print(dataset)                                                                  
    

    This will print the exact same dataset you start with but with NaNs replacing the outliers.

    Note in the above that generalizedESD doesn't just return something but it returns a tuple of things, the relevant one being what I called idxs, an array containing the indexes of the elements detected as outliers.

    Instead, if for some reason you want to perform this operation group-wise, you can group-then-apply to each group, like you mention:

    def replace_outliers_groupwise(group):                                                                  
        return group.apply(replace_outliers_columnwise)                                                     
                                                                                                            
                                                                                                            
    dataset = dataset.groupby("Group").apply(replace_outliers_groupwise)                                    
    print(dataset)
    

    which will print the same dataframe but now with slightly different indexes denoting the different groups into which the dataframe was split.

    All this said, unless you're interested in applying different parameters to generalizedESD to each group, it is not clear why you wouldn't just apply to each column directly instead of grouping first.