pythonmatplotlibregressiondata-scienceoutliers

Python: finding outliers from a trend of data


Notice this post is not duplicated to any of the following relevant pieces on SO:

I was given data in an experiment:


    import matplotlib.pyplot as plt
    
    x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
    y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
    y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
    y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
    
    plt.plot(x, y_NaOH)
    plt.plot(x, y_NaHCO3)
    plt.plot(x, y_BaOH2)
    plt.show()

enter image description here

However, I had trouble marking the outliers, here's what I have tried:


    import matplotlib.pyplot as plt
    import statistics
    
    x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
    y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
    y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
    y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
    
    # plt.plot(x, y_NaOH)
    # plt.plot(x, y_NaHCO3)
    # plt.plot(x, y_BaOH2)
    # plt.show()
    
    
    def detect_outlier(data_1):
        threshold = 1
        mean_1 = statistics.mean(data_1)
        std_1 = statistics.stdev(data_1)
        result_dataset = [y  for y in data_1 if abs((y - mean_1)/std_1)<=threshold ]
    
        return result_dataset
    
    
    if __name__=="__main__":
        dataset = y_NaHCO3
        result_dataset = detect_outlier(dataset)
        print(result_dataset)
        # [374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0]

Incorrectly, this method always filter out the edge values of my data, actually I seek to remove the dots that doesn't fit the curve.


Plus, I can observe the shape of the curve and mark the outliers manually, but it really costs a lot of time. I will be very grateful for your help.


Expected output

I want to sketch the data in line and mark the outliers as dots, for example:


    from matplotlib import pyplot as plt
    
    x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
    y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
    y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
    y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
    
    o_NaOH = [542.2]
    o_NaHCO3 = [308.0]
    o_BaOH2 = [493.1]
    
    
    def sketch_rejected(xv, yv, y_out):
        nx = []
        ny = []
        x_out = []
        for ii, dd in enumerate(yv):
            if dd not in y_out:
                nx.append(xv[ii])
                ny.append(dd)
            else:
                x_out.append(xv[ii])
        plt.plot(nx, ny)
        plt.scatter(x_out, y_out)
    
    
    sketch_rejected(x, y_NaOH, o_NaOH)
    sketch_rejected(x, y_NaHCO3, o_NaHCO3)
    sketch_rejected(x, y_BaOH2, o_BaOH2)
    
    plt.show()

enter image description here

the outliers are the spiky parts of the curve which the dot doesn't fit the gradient.

Instead of manually sketch each graph and identify the outliers, can I use a module to regress the data at first, then calculate the outliers.

In real life, I have tons of testing result and I don't know the general equation of each.

Appreciate for your help.


Solution

  • There are quite a few GitHub repos for data science, all you have to do is complete your git installation

    For using outliers.py

    
        from outliers.variance import graph
        
        x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
        y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
        y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
        y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
        
        graph(
            xs=x,
            ys=[y_NaOH, y_NaHCO3, y_BaOH2],
            title='title',
            legends=[f'legend {i + 1}' for i in range(len(x))],
            xlabel='xlabel',
            ylabel='ylabel',
        )
        
    
    

    enter image description here