pythonpandasnumpyscipypower-law

Problem with r2 value of power law fit: Is this fit really worse than a straight line?


I have a dataframe of x and y data. The data exhibit a general power law form so I would like to fit a curve that's a power law. However when I do so I get a r2 value of -2.63 which tells me that the fit is worse than a straight line would be. Is this really the case?

I calculated r2 using two methods (r_squared and r_squared_2). I'm not sure if the issue is in how I fitted the power law curve or in my r2 calculation.

import numpy as np
import matplotlib.pyplot as plt
from math import exp
import pandas as pd
from scipy.optimize import curve_fit

new_dat = pd.DataFrame( {'x': {0: 133.39072904912717, 1: 138.24394099399626, 2: 123.68098616912548, 3: 145.53622634264102, 4: 115.66589965763, 5: 102.3954554988245, 6: 188.36808402142134, 7: 159.973167750876, 8: 109.27288573117258, 9: 228.48970913145482, 10: 59.72149772767079, 11: 107.86289489093632, 12: 139.54636990829943, 13: 69.16804084300782, 14: 128.51492415467243, 15: 123.89886969748194, 16: 96.69524458890069, 17: 179.4218007796204, 18: 81.94248858920511, 19: 116.57431987139303, 20: 86.72287597716091, 21: 104.26504167186982, 22: 96.21176975617014, 23: 113.05563002252855, 24: 95.13881793216328, 25: 90.24566440833108, 26: 120.21979837370618, 27: 148.02989788213065, 28: 131.5536333505709, 29: 43.98432257846345, 30: 151.20808505875556, 31: 106.90408749635041, 32: 208.84439653547977, 33: 141.93620845530992, 34: 66.06470015823503, 35: 144.26451450341665, 36: 268.44231416009114, 37: 104.21592558477657, 38: 87.4647314243362, 39: 21.62506288172477, 40: 211.8288449343543, 41: 137.1783782430448, 42: 152.68656578316114, 43: 71.40444647539057, 44: 138.26429570303063, 45: 195.8195134445166, 46: 65.0580543537033, 47: 91.53609270183331, 48: 133.93031838426649, 49: 130.18323679275105}, 'y': {0: 3385.9107941013963, 1: 3767.4773129933837, 2: 3393.972804533385, 3: 3207.540799419189, 4: 3503.971988612639, 5: 2699.582157811891, 6: 3472.2197615815303, 7: 3734.8682981154525, 8: 3015.1595391710443, 9: 3833.7005103694264, 10: 2180.8813084622725, 11: 3057.1175212715566, 12: 3322.2694622283707, 13: 2625.095843511092, 14: 3428.9305902296073, 15: 3665.3597140080483, 16: 3348.2359174389712, 17: 3203.650823344419, 18: 2314.649577797234, 19: 3445.142411753858, 20: 2854.0698989716257, 21: 3224.426663700497, 22: 2975.4529990214733, 23: 2830.8849349346683, 24: 2757.178895276296, 25: 2804.842233145504, 26: 2580.295378480375, 27: 3451.028240314123, 28: 3559.855598644374, 29: 1682.632983470442, 30: 3573.6640120241777, 31: 2612.5922620115434, 32: 3047.6869797329296, 33: 3500.1611748529945, 34: 2976.7358839883286, 35: 3270.15016432246, 36: 3702.39276797799, 37: 3174.024034000559, 38: 3116.0557991571313, 39: 2277.6763663475185, 40: 2907.349510347204, 41: 2959.5225286559644, 42: 3523.744356032963, 43: 2793.4503330781213, 44: 3688.929929188237, 45: 3654.619681532315, 46: 2077.6749192493166, 47: 2692.8596392079253, 48: 3365.5547117446195, 49: 3357.554384166426}} )
x,y = new_dat.x, new_dat.y

def power_law(x, a, b):
    return a*np.power(x, b)

# find best fit
popt, pcov = curve_fit(power_law, x, y)

# plot data and best fit curve.
plt.plot(x, y,'ok')
x = np.linspace(5, 300, len(x))
plt.plot(x, power_law(x, *popt),'r-',markersize=3,  linewidth=2.5)

# get r2 via method 1
residuals = y - power_law(x, *popt)
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y-np.mean(y))**2)
r_squared = 1 - (ss_res / ss_tot)

# get r2 via method 2
from sklearn.metrics import r2_score
r_squared_2 = r2_score(y, power_law(x, *popt), multioutput='variance_weighted')

# plot labels
x_lab, y_lab, title = 'X data', 'Y data', 'Power law fit'
plt.xlabel(x_lab)
plt.ylabel(y_lab)
plt.title(title, fontweight="bold")

#ss_res / ss_tot
#print(ss_res, ss_tot)
#print(popt[0],popt[1])

# print both r2 methods
print('r2 from method 1 is:',r_squared)
print('r2 from methood 2 is:',r_squared_2)

power law fit results

p.s. I know that the data look linear, but I want to be able to fit a power law distribution since the number of actual data points is much larger than len(x).


Solution

  • Look at these lines:

    popt, pcov = curve_fit(power_law, x, y)
    
    # plot data and best fit curve.
    plt.plot(x, y,'ok')
    x = np.linspace(5, 300, len(x))
    plt.plot(x, power_law(x, *popt),'r-',markersize=3,  linewidth=2.5)
    

    First you fit your points with the original x data. After that though, x is overwritten by np.linspace. Check the documentation: the method changes your original values!

    Try to change the lines to those:

    popt, pcov = curve_fit(power_law, x, y)
    
    # plot data and best fit curve.
    plt.plot(x, y,'ok')
    plt.plot(np.sort(x), power_law(np.sort(x), *popt),'r-',markersize=3,  linewidth=2.5)
    

    You will see that r2 increased to 0.54. If you think it is still too low, you might want to try with more data points if you have them or, as already suggested, to change the curve to fit.

    P.S: I used np.sort() to correctly plot the fitted curve. If used without, the curve messes the plot (try it yourself :D).