pythonpandasequationcoefficientstrendline

To get trend-line's equation (polynomial, order 2)


A simple dataframe that I want to plot it with its trend-line (polynomial, order 2). However I got the equation obviously wrong:

y = 1.4x**2 + 6.6x + 0.9

It shall be:

y = 0.22x2 - 1.45x + 11.867  # the "2" after x is square

How can I get the correct equation?

import matplotlib.pyplot as plot
from scipy import stats
import numpy as np

data = [["2020-03-03",9.727273],
["2020-03-04",9.800000],
["2020-03-05",9.727273],
["2020-03-06",10.818182],
["2020-03-07",9.500000],
["2020-03-08",10.909091],
["2020-03-09",15.000000],
["2020-03-10",14.333333],
["2020-03-11",15.333333],
["2020-03-12",16.000000],
["2020-03-13",21.000000],
["2020-03-14",28.833333]]

fig, ax = plot.subplots()

dates = [x[0] for x in data]
usage = [x[1] for x in data]

bestfit = stats.linregress(range(len(usage)),usage)

equation = str(round(bestfit[0],1)) + "x**2 + " + str(round(bestfit[1],1)) + "x + " + str(round(bestfit[2],1)) 

ax.plot(range(len(usage)), usage)
ax.plot(range(len(usage)), np.poly1d(np.polyfit(range(len(usage)), usage, 2))(range(len(usage))), '--',label=equation)

plot.show()

print (equation)

enter image description here


Solution

  • You should define your question better, and I'll explain.

    You are trying to fit polynom of second degree (quadratic polynomial function), using series of dates as input, and series of value as output. The problem, is that you have to define what is "zero"- your reference point for the date values. The way you handle that in your code, which is reasonable- but you need to validate that it fits the problem you are trying to solve, is to just look at the 'index' of the date, starting from 0.

    When I replace the way you calculate 'bestfit' with the same function you used for generating the graph, I receive similar results to the results you wanted:

    Polynomial Equation: 0.22x^2 + -1.02x + 10.63

    Two ways that can help you understand the different results I got, from the ones you wanted:

    1. The optional parameter rcond that can be added to the calculation (numpy.polyfit documation)
    2. maybe the numbers you used as y values were rounded, had more decimal points in the original data you used for calculation.

    Here is the updated code:

    import matplotlib.pyplot as plot
    from scipy import stats
    import numpy as np
    
    data = [["2020-03-03",9.727273],
    ["2020-03-04",9.800000],
    ["2020-03-05",9.727273],
    ["2020-03-06",10.818182],
    ["2020-03-07",9.500000],
    ["2020-03-08",10.909091],
    ["2020-03-09",15.000000],
    ["2020-03-10",14.333333],
    ["2020-03-11",15.333333],
    ["2020-03-12",16.000000],
    ["2020-03-13",21.000000],
    ["2020-03-14",28.833333]]
    
    fig, ax = plot.subplots()
    
    dates = [x[0] for x in data]
    usage = [x[1] for x in data]
    
    bestfit = np.polyfit(range(len(usage)), usage, 2)
    
    equation = str(round(bestfit[0],2)) + "x**2 + " + str(round(bestfit[1],2)) + "x + " + str(round(bestfit[2],2)) 
    
    ax.plot(range(len(usage)), usage)
    ax.plot(range(len(usage)), np.poly1d(np.polyfit(range(len(usage)), usage, 2))(range(len(usage))), '--',label=equation)
    
    plot.show()
    
    print (equation)
    enter image description here