python scipy curve-fitting scipy-optimize non-linear-regression

Why does my logistic curve_fit prediction return huge values although R² ≈ 0.99?

I’m trying to fit a logistic function to GDP data (China GDP dataset) using scipy.optimize.curve_fit. The fitting looks very good (R² ≈ 0.99, curve matches the data), but when I try to predict a single value for a given year, the result is extremely large (≈ 1e24) while the real value is only ≈ 9e10.

Here is a simplified version of my code:

import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
from scipy.special import expit

df = pd.read_csv("china_gdp.csv")
msk = np.random.rand(len(df)) < 0.8
x = df['Year'].values.astype(float)
y = df['Value'].values.astype(float)

# normalize x, y to [0,1]
x_s = (x - x.min()) / (x.max() - x.min())
y_s = (y - y.min()) / (y.max() - y.min())

def logistic(x, B1, B2, B3, B4):
    return B3 + B4 * expit(B1*(x - B2))

# initial guess and bounds
p0 = (5.0, 0.5, 0.0, 1.0)
bounds = ([0.0, 0.0, -np.inf, 0.0], [np.inf, 1.0, np.inf, np.inf])

popt, pcov = curve_fit(logistic, x_s, y_s, p0=p0, bounds=bounds)
print(dict(zip(['B1','B2','B3','B4'], popt)))

# check fit
import matplotlib.pyplot as plt
xx = np.linspace(0, 1, 300)
yy = logistic(xx, *popt)
plt.scatter(x_s, y_s, s=15, label='data')
plt.plot(xx, yy, label='fit')
plt.legend(); plt.show()

from sklearn.metrics import r2_score
xtest = x_s[~msk]
ytest = y_s[~msk]
ypred = logistic(xtest, *popt)
print(r2_score(ytest, ypred))

x_in = 1970
rs = logistic((x_in - x.min()) / (x.max() - x.min()), *popt)
r = rs * (y.max() - y.min()) + y.min()
print(r)

Output:

0.9985891574981185
192285528141.3661

But the real GDP value for 1970 is 91506211306.3745.

Am I mis-scaling my input/output, or is there an issue with how I define the logistic function / initial parameters?

I expected the prediction to be close to the real GDP values since the fit looks good and the R² score is high. I suspect I'm doing something wrong with scaling or with the logistic function parameters.

Dataset link

Solution

I am taking a look at the code and the plots, and I think you did everything correctly from a code perspective. The issue that you are encountering when looking at the 1970 gdp value is actually just the natural interplay between the data itself and the function you are trying to fit it to! If you look at the fitting function you created, 'logistic', it is basically a scalar plus a sigmoid function with a nonlinear input of 2 variables. Looking at the scipy docs you can see the shape of the sigmoid function here, this is what you are fitting to:

You can see you are basically working with a monotonic shape with mild slopes at the start and end, and an intensive slope in the middle. Taking a look at your prediction plot, you can see that there is an area where China's gdp takes off. When looking at the data, this roughly happens around the early-mid 90's:

With the set up you are creating here, the curve being fit is using a loss function based on values in the range of [0, 1], since you are normalizing the time and gdp values before the sigmoid function. Looking at the 1970 dot (tenth from the left), you can see that the fitted curve is just a bit above the real value on this absolute y scale. In fact, if you take a look at the sum of absolute errors for the normalized output, you can see that this fit is actually not too bad over the whole dataset from 1960-2014! You can do the sum of square error here, but since all the values are between 0 and 1, the error measurement will come out much smaller since you are squaring values under 1 and thus under report the errors.

print(sum(abs(ytest-ypred))
>>>0.26858

All of this context is just to lead to my main point, fitting this bumpy exponential data to a sigmoid function will absolutely cause some of the smaller values to be much different from each other, especially on a percentage basis. There are just not enough degrees of freedom in this function to actually capture all parts of a curve of this particular shape perfectly. I alluded to the fact that there is a take off of China's gdp in the early-mid 90's, if we truncate the data to just 1960-1991 we can avoid the steepest part of the take off and evaluate the 1970 point then. Here is how I modified the code I used for that:

_crop = 32
msk = np.random.rand(_crop) < 0.8
x = df['Year'].values.astype(float)[:_crop]
y = df['Value'].values.astype(float)[:_crop]

When we do this the resulting value for 1970 is 94736151945.78181 which is much closer to the 91506211306.3745 value that it is. The resulting plot for it looks like this:

You can see that the 1970 point (the tenth one) is much closer to its predicted value.

With all that being said, when fitting any curve, there is always a limitation to the fitting function you are using in that it will excel at fitting data that has similar behaviors to that function. A first order sigmoid function may not be the best at catching little bumps or just general fluctuation throughout. If you are curious about how to get all of the value predictions tighter, you can try fitting a polynomial function and keep increasing the degrees of freedom until you notice it starts capturing the values too well (ie overfitting).

Edit: OP asked some good questions in the comments about this answer. The contents below are to address these comments:

OP Q1: But I still have a question: what is the right way to reduce this problem without ending up with overfitting?

OP Q2: So another question is: can we actually trust such a model to predict values as far as 2040? Could it be that the point where growth slows down at the end actually happens earlier or later in reality?

---- Q1 ----

But I still have a question: what is the right way to reduce this problem without ending up with overfitting?

I make a little python script for you to play with the degrees of freedom to see how things get fitted, underfitted, or overfitted. Instead of using a sigmoid function to fit on, the example script a I made uses a simple polynomial function with a least squares fitting operation. The degrees here are pretty straight forward to understand for the polynomial, I am generating new weights for new degrees. For example degree 2 would have 3 weights "w" as: w0*(x^0) + w1*(x^1) + w2*(x^2); degree 3 would have 4 weights: w0*(x^0) + w1*(x^1) + w2*(x^2) + w3*(x^3); x^0 == 1, so most examples normally exclude this and just say w0, and not w0 * (x^0), but I included it here for completeness. Take a look at the plot below to see how the degrees of a polynomial fitting func works. For these plots I just randomly generated a "real" y testing vector and kept it the same for each of the plots below. You see the impact of the degrees of fitting on the same "real" y data:

Here is the code I used to generate this plot, feel free to play with the degree step variable or the random seed variable:

import numpy as np
import matplotlib.pyplot as plt

# step size between degrees of the fitting; 6 different values are used
# the degree_step scales this be degree_step*degrees[i]
# NOTE: EDIT THIS TO SEE HOW DIFFERENT DEGREES WORK!
degree_step = 3
# set your own seed for the random data
random_seed = 1

# ----------------------------------------------------------------------------
# generate degrees
deg = degree_step * (1+np.arange(6))
# creating a random vector with 50 elements in the range of [0, 100]
elements = 50
x = np.arange(elements)
np.random.seed(int(random_seed)); y_real = 1e2*np.random.rand(elements)

# do least squares on polynomial func of i degree, then make predictions
y_preds = [j(x) for j in [np.poly1d(np.polyfit(x, y_real, i)) for i in deg]]

# generate the subplots and plot it altogether
fig, axs = plt.subplots(3, 2)
ax_ind = sorted((i%3, i%2) for i in range(6))
for i in range(len(deg)):
    axs[*ax_ind[i]].set_title(f'Degree = {deg[i]}')
    axs[*ax_ind[i]].scatter(x, y_real, s=15, label='data')
    axs[*ax_ind[i]].plot(x, y_preds[i], label='fit')
    axs[*ax_ind[i]].legend()
plt.show()

---- Q2 ----

So another question is: can we actually trust such a model to predict values as far as 2040? Could it be that the point where growth slows down at the end actually happens earlier or later in reality?

This is more of a dataset specific question, but luckily I know a little about financial data :) In the case of transitioning economies (https://en.wikipedia.org/wiki/Transition_economy) sigmoid function can make sense for a while, the shift to the private sector (especially in the case of China that is the second most populated country in the world) essentially allows for uncapped market opportunity whereas it was capped before. This is motivated by a ton of factors, but some of the larger factors are: existing public industry is privatized and starts directly contributing to GDP in a manner that might have not bee accurately reported before, markets now respond directly to cash flows instead of government requests which impacts both supply/demand dynamics and the freedom of pricing therein, and if free trade is permissive enough then the number of global markets expands from just 1 country to potentially 100's of countries. These factors are just a simple unlock of what is already there that creates such an explosion, but this will dampen over time as the latent potential is fully actualized in a capitalistic sense. I doubt something like a sigmoid function will fit Chinese GDP over the long run since their industry development way of thinking has been basically 'innovate first, regulate later'. This way of thinking is pretty well demonstrated with the sky's of Beijing being the most polluted in 2013 to significantly cleaner in 2025, here is the regulation: https://english.mee.gov.cn/News_service/infocus/201309/t20130924_260707.shtml

Besides China, economies (as measured by GDP in our example) go up and down all the time, which makes fitting a single sigmoid function to these pretty useless (horrible fit). Here are some other country's GDPs I took from FRED, fitting a sigmoid function will not be that good. You can download the data in csv files and run these through your own model to see how it would do:

Iran GDP:
https://fred.stlouisfed.org/series/MKTGDPIRA646NWDB

Guyana GRP:
https://fred.stlouisfed.org/series/MKTGDPGYA646NWDB

Equatorial Guinea GDP:
https://fred.stlouisfed.org/series/MKTGDPGQA646NWDB

Japan GDP:
https://fred.stlouisfed.org/series/JPNNGDP

By asking "can we actually trust such a model to predict values as far as 2040?" you are asking a low-key profound question which is at the heart of all forecasting models. All mathematical models are predicated on fitting data to a specific function with a specific loss function. In my example code this was fitting random data to a polynomial function with a least squares loss function. How much you can trust your model into an out of sample point is purely a function of how much you trust that your maths captures reality and the expectations of future reality!