pythonscikit-learngaussian-processonline-machine-learning

Gaussian Process Regression incremental learning


I am using the scikit-learn implementation of Gaussian Process Regression here and I want to fit single points instead of fitting a whole set of points. But the resulting alpha coefficients should remain the same e.g.

gpr2 = GaussianProcessRegressor()
    for i in range(x.shape[0]):
        gpr2.fit(x[i], y[i])

should be the same as

gpr = GaussianProcessRegressor().fit(x, y)

But when accessing gpr2.alpha_ and gpr.alpha_, they are not the same. Why is that?

Indeed, I am working on a project where new data points arise. I dont want to append the x, y arrays and fit on the whole dataset again as it is very time intense. Let x be of size n, then I am having:

n+(n-1)+(n-2)+...+1 € O(n^2) fittings

when considering that the fitting itself is quadratic (correct me if I'm wrong), the run time complexity should be in O(n^3). It would be more optimal, if I do a single fitting on n points:

1+1+...+1 = n € O(n)


Solution

  • What you refer to is actually called online learning or incremental learning; it is in itself a huge sub-field in machine learning, and is not available out-of-the-box for all scikit-learn models. Quoting from the relevant documentation:

    Although not all algorithms can learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates. Actually, the ability to learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the main memory.

    Following this excerpt in the linked document above, there is a complete list of all scikit-learn models currently supporting incremental learning, from where you can see that GaussianProcessRegressor is not one of them.