I am currently trying to classify a bunch of rivers in regard to their behavior. Many of the rivers have a behavior that is very similar to a second degree polynom.
However, some of the rivers have some areas where they diverge from this pattern.
I want to classify this by calculating how far all points are away from the simple polynom. So it would basically look something like this:
But to be able to do this, I have to calculate the polynom for only those points that are "normal behavior". Otherwise my polynom is shifted to the direction of the diverging behavior and I cannot calculate the distances correctly.
Here is some example data.
x_test = [-150,-140,-130,-120,-110,-100,-90,-80,-70,-60,-50,-40,-30,-20,-10,0,10,20,30,40,50,60,70,70,80,80,90,90,100,100]
y_test = [0.1,0.11,0.2,0.25,0.25,0.4,0.5,0.4,0.45,0.6,0.5,0.5,0.6,0.6,0.7, 0.7,0.65,0.8,0.85,0.8,1,1,1.2,0.8,1.4,0.75,1.4,0.7,2,0.5]
I can create a polynom from it with numpy.
fit = np.polyfit(x_test, y_test, deg=2, full=True)
polynom = np.poly1d(fit[0])
simulated_data = polynom(x)
When I plot it, I get the following:
ax = plt.gca()
ax.scatter(x_test,y_test)
ax.plot(x, simulated_data)
As you can see, the polynom is shifted slightly downward, which is caused by the points marked black here:
Is there a straightforward way to find those points that do not follow the main trend and exclude them for creating the polynom?
This looks like an AI problem more than a plain fit problem: how do you personally decide what doesn't fit - particularly in your second diverging graph, where the short first upward curve looks polynomial if you ignore the larger curve?
You only need 3 points to compute a 2-polynomial: How about computing curves for all/many samplings of 3 well-horizonally-spaced points (can't necessarily trust the first or last point) and see which creates the fewest outliers - points which are further away than 90% of the others?
You can then compute the curve based on the remaining non-outlier points, and check it fits your trivially computed curve.
Edit: 'well spaced' was intended to mean one point each from each horizontal third of the points - there's no point in using three ooints all jammed up together to try to extrapolate to the others. Also, from the looks of your supplied data, you want a curve starting around the origin and going up, so you could filter some of the randomly generated curves anyway.
Edit: The outlier suggestion was sloppy - if your data gets wider at the end, like a trumpet, you have a number of plausible fits, so it's only where it does obvious spurs that you could have a clear marker for outliers. If you compute a histogram of points vs distance from each random curve, you could scan for shoulders and asymmetries in the histogram tangents that take it away from a bell curve, and slice for outliers at that point.
Fundamentally, I think the data's potentially too complex for more than computer-aided analysis, unless you break out computer-vision techniques: get the computer to do the best it can, then visually inspect the annotated graphs to see whether you agree with it.
It might also help to plot the log of the vertical axis, so you're dealing with straight lines.