python matplotlib regression data-science outliers

Python: finding outliers from a trend of data

Notice this post is not duplicated to any of the following relevant pieces on SO:

I was given data in an experiment:


    import matplotlib.pyplot as plt
    
    x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
    y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
    y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
    y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
    
    plt.plot(x, y_NaOH)
    plt.plot(x, y_NaHCO3)
    plt.plot(x, y_BaOH2)
    plt.show()

However, I had trouble marking the outliers, here's what I have tried:


    import matplotlib.pyplot as plt
    import statistics
    
    x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
    y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
    y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
    y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
    
    # plt.plot(x, y_NaOH)
    # plt.plot(x, y_NaHCO3)
    # plt.plot(x, y_BaOH2)
    # plt.show()
    
    
    def detect_outlier(data_1):
        threshold = 1
        mean_1 = statistics.mean(data_1)
        std_1 = statistics.stdev(data_1)
        result_dataset = [y  for y in data_1 if abs((y - mean_1)/std_1)<=threshold ]
    
        return result_dataset
    
    
    if __name__=="__main__":
        dataset = y_NaHCO3
        result_dataset = detect_outlier(dataset)
        print(result_dataset)
        # [374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0]

Incorrectly, this method always filter out the edge values of my data, actually I seek to remove the dots that doesn't fit the curve.

Plus, I can observe the shape of the curve and mark the outliers manually, but it really costs a lot of time. I will be very grateful for your help.

Expected output

I want to sketch the data in line and mark the outliers as dots, for example:


    from matplotlib import pyplot as plt
    
    x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
    y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
    y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
    y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
    
    o_NaOH = [542.2]
    o_NaHCO3 = [308.0]
    o_BaOH2 = [493.1]
    
    
    def sketch_rejected(xv, yv, y_out):
        nx = []
        ny = []
        x_out = []
        for ii, dd in enumerate(yv):
            if dd not in y_out:
                nx.append(xv[ii])
                ny.append(dd)
            else:
                x_out.append(xv[ii])
        plt.plot(nx, ny)
        plt.scatter(x_out, y_out)
    
    
    sketch_rejected(x, y_NaOH, o_NaOH)
    sketch_rejected(x, y_NaHCO3, o_NaHCO3)
    sketch_rejected(x, y_BaOH2, o_BaOH2)
    
    plt.show()

the outliers are the spiky parts of the curve which the dot doesn't fit the gradient.

Instead of manually sketch each graph and identify the outliers, can I use a module to regress the data at first, then calculate the outliers.

In real life, I have tons of testing result and I don't know the general equation of each.

Appreciate for your help.

Solution

There are quite a few GitHub repos for data science, all you have to do is complete your git installation

For using outliers.py


    from outliers.variance import graph
    
    x = [22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50]
    y_NaOH = [94.2, 146.2, 222.2, 276.2, 336.2, 372.2, 428.2, 542.2, 576.2, 684.2, 766.2, 848.2, 904.2, 1042.2, 1136.2]
    y_NaHCO3 = [232.0, 308.0, 322.0, 374.0, 436.0, 494.0, 592.0, 660.0, 704.0, 824.0, 900.0, 958.0, 1048.0, 1138.0, 1232.0]
    y_BaOH2 = [493.1, 533.1, 549.1, 607.1, 665.1, 731.1, 797.1, 867.1, 971.1, 1007.1, 1091.1, 1221.1, 1311.1, 1371.1, 1497.1, ]
    
    graph(
        xs=x,
        ys=[y_NaOH, y_NaHCO3, y_BaOH2],
        title='title',
        legends=[f'legend {i + 1}' for i in range(len(x))],
        xlabel='xlabel',
        ylabel='ylabel',
    )