gnuplotoutliers

Gnuplot - How to ignore outliers for the fit?


I had started working with Gnuplot and tried out a few things. Now, I was wondering how to automatically remove outliers from the fit. An example is shown in the figure with a data point at 4,50 from the second data set.Outlier in "data set 2" distorts the fit And the data set:

I've found a similar question here, but I couldn't make it work for my example. There might be a lot of different approaches and I'm not that experienced with Gnuplot or similar software. So, I would be glad about suggestions, what would be a possible approach to describe outliers.

I'm using the gnuplottex package in LaTeX (texlive) on Windows 10. The gnuplot code:

\begin{gnuplot}[terminal=tikz, terminaloptions={color size 7cm,5cm}]
reset session

$Data <<EOD
#data
x   y1  y2  y3  y4
1   1   6   4   2   
2   4   10  1   1   
3   9   15  0   0.5 
4   16  50  1   2   
5   25  31  4   5   
6   36  42  9   12  
7   49  55  30  23
EOD

datafile = 'data.dat'
set print 'parameters.dat'

#_____________Set the label for data points________________________
set key top left                            # set position of legend
set key Left                                # set raggedleft
set key samplen 2 spacing 1.2 font ",8" # set fontsize and spacing
set key noautotitle 

###1__________Define function and number of columns_________________________
f(x,a,b,c) = a*(x-b)**2 + c
colMin = 2
colMax = 5
set fit quiet nolog
array A[colMax]
array B[colMax]
array C[colMax]

do for [col=colMin:colMax] {
    a=1; b=1; c=4            # some initial values, sometimes 0 or NaN is not a good start
    fit f(x,a,b,c) datafile u 1:col via a,b,c
    A[col] = a;  B[col] = b;  C[col] = c
    
    print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
}

plot for [col=colMin:colMax] datafile u 1:col ls col, \
     for [col=colMin:colMax] f(x,A[col],B[col],C[col]) ls col, \
     for [col=colMin:colMax] keyentry w lp ls col \ 
     title sprintf("$y%d$",col-1)
\end{gnuplot}

Solution

  • As mentioned in the comments you have to somehow define what you consider as outlier. There are certainly several ways how to do that. I'm not claiming that this is the best way, just consider it as a starting point.

    Some Comments:

    This can certainly be optimized.

    Data: "SO77774328.dat

    x   y1  y2  y3  y4
    1    1    6   4    2
    2    4   10   1    1
    3    9   15   0    0.5
    4   16   50   1    2
    5   25   31   4    5
    6   36   42   9   12
    7   49   55  30   23
    

    Script:

    ### remove outliers for fitting
    reset session
    
    FILE     = "SO77774328.dat"
    PARAMS_1 = "SO77774328_1.par"
    PARAMS_2 = "SO77774328_2.par"
    
    f(x,a,b,c) = a*(x-b)**2 + c
    colMin = 2
    colMax = 5
    set fit quiet nolog
    array A[colMax]
    array B[colMax]
    array C[colMax]
    
    set print PARAMS_1
    do for [col=colMin:colMax] {
        a=1; b=1; c=4            # some initial values, sometimes 0 or NaN is not a good start
        fit f(x,a,b,c) FILE u 1:col via a,b,c
        A[col] = a;  B[col] = b;  C[col] = c
        print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
    }
    unset print
    
    # write data to table with outliers --> NaN
    OutlierDist = 10   # outlier distance
    dev(colX,colY) = abs(column(colY)-f(column(colX),A[colY],B[colY],C[colY])-1) >= OutlierDist ? NaN :  column(colY)
    set table $NOOUTLIERS
        do for [colY=colMin:colMax] {
            plot FILE u 1:(v0=dev(1,colY)):(v0!=v0?column(colY):NaN) lc var
        }
    unset table
    
    # fit again
    set print PARAMS_2
    do for [col=colMin:colMax] {
        i = col-colMin   # datablock index
        a=1; b=1; c=4            # some initial values, sometimes 0 or NaN is not a good start
        fit f(x,a,b,c) $NOOUTLIERS index i u 1:2 via a,b,c
        A[col] = a;  B[col] = b;  C[col] = c
        print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
    }
    unset print
    
    set key noautotitle left top
    
    plot for [col=colMin:colMax] FILE u 1:col ls col-1, \
         for [col=colMin:colMax] f(x,A[col],B[col],C[col]) ls col-1, \
         for [col=colMin:colMax] keyentry w lp ls col-1 title sprintf("y%d",col-1), \
         $NOOUTLIERS u 1:(valid(2) ? NaN : column(3)) w p pt 6 ps 2 lc "red" ti "Outlier"
    ### end of script
    

    Result:

    enter image description here