pythonmachine-learningscikit-learngaussian-processgpy

Gaussian process binary classification: why is the variance with GPy much smaller than with scikit-learn?


I'm learning about binary classification with Gaussian Processes and I am comparing GPy with scikit-learn on a toy 1D problem inspired by Martin Krasser's blog post. Both implementations (GPy and scikit-learn) seem to use a similar setup with a RBF kernel. After optimizing the kernel hyperparameters, the lenghtscales are similar, but the variances differ by a lot. The GPy kernel variance appears to be too small.

How can I modify my GPy implentation and obtain a result similar to those with scikit-learn? I suspect it has to do with the internal implentation of each algorithm, but I can't tell what's causing this drastic difference. I explain further below why I believe my GPy implementation needs to be fixed.

Implementation details: Python 3.9 with GPy 1.13.2 and scikit-learn 1.5.1. Reproducible example:

import numpy as np
from scipy.stats import bernoulli
from scipy.special import expit as sigmoid

##############################
# Part 1: toy dataset creation
##############################

np.random.seed(0)
X = np.arange(0, 5, 0.05).reshape(-1, 1)
X_test = np.arange(-2, 7, 0.1).reshape(-1, 1)

a = np.sin(X * np.pi * 0.5) * 2  # latent function
t = bernoulli.rvs(sigmoid(a))    # Bernoulli training data (0s and 1s)

#####################################
# Part 2: scikit-learn implementation
#####################################

from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import ConstantKernel, RBF

rbf = ConstantKernel(1.0, constant_value_bounds=(1e-3, 10)) \
        * RBF(length_scale=1.0, length_scale_bounds=(1e-3, 10))
gpc = GaussianProcessClassifier(
    kernel=rbf,
    optimizer='fmin_l_bfgs_b',
    n_restarts_optimizer=10)

gpc.fit(X_scaled, t.ravel())

print(gpc.kernel_)
# 1.5**2 * RBF(length_scale=0.858)

############################
# Part 3: GPy implementation
############################

import GPy

kern = GPy.kern.RBF(
    input_dim=1,
    variance=1.,
    lengthscale=1.)
kern.lengthscale.unconstrain()
kern.variance.unconstrain()
kern.lengthscale.constrain_bounded(1e-3, 10)
kern.variance.constrain_bounded(1e-3, 10)

m = GPy.core.GP(
    X=X,Y=t, kernel=kern, 
    inference_method=GPy.inference.latent_function_inference.laplace.Laplace(),    
    likelihood=GPy.likelihoods.Bernoulli())

m.optimize_restarts(
    num_restarts=10, optimizer='lbfgs',
    verbose=True, robust=True)

print(m.kern)
#  rbf.         |               value  |  constraints  |  priors
#  variance     |  0.8067562453940487  |  0.001,10.0   |        
#  lengthscale  |  0.8365668826459536  |  0.001,10.0   |

The lenghtscale values are roughly similar (0.858 vs 0.836), but the variance values are very different (1.5**2 = 2.25 for scikit-learn and only 0.806 for GPy).

The reason why I believe my GPy implementation needs adjustments is that the true latent function (see "a" in part 1 of code above) does not closely match the predicted one even with +/- 2 standard deviation bounds. On the other hand, the scikit-learn implementation matches it reasonably well (it it possible to retrieve the latent function mean and std. deviation with scikit-learn as shown here).

Left: predicted probabilities are similar with both models (makes sense since they share similar lenghtscale values). Right: the predicted latent function of GPy does not fit the true latent function as well as the scikit-learn model does.

What I've tried so far, without significant change in the results:


Solution

  • It turns out GPy's Bernoulli likelihood class maps the latent predictions to a probability using the standard normal CDF (source code shows the default is the probit link function). The dataset in my example instead uses the sigmoid function, meaning that a logit link function would instead be needed. Even though it's not implemented in the GPy library, we can create our own link function object and pass it to the GPy model when initializing it. Here's an updated version of my code (see changes in Part 3):

    import numpy as np
    from scipy.stats import bernoulli
    from scipy.special import expit as sigmoid
    
    ##############################
    # Part 1: toy dataset creation
    ##############################
    
    np.random.seed(0)
    X = np.arange(0, 5, 0.05).reshape(-1, 1)
    X_test = np.arange(-2, 7, 0.1).reshape(-1, 1)
    
    a = np.sin(X * np.pi * 0.5) * 2  # latent function
    t = bernoulli.rvs(sigmoid(a))    # Bernoulli training data (0s and 1s)
    
    #####################################
    # Part 2: scikit-learn implementation
    #####################################
    
    from sklearn.gaussian_process import GaussianProcessClassifier
    from sklearn.gaussian_process.kernels import ConstantKernel, RBF
    
    rbf = ConstantKernel(1.0, constant_value_bounds=(1e-3, 10)) \
            * RBF(length_scale=1.0, length_scale_bounds=(1e-3, 10))
    gpc = GaussianProcessClassifier(
        kernel=rbf,
        optimizer='fmin_l_bfgs_b',
        n_restarts_optimizer=10)
    
    gpc.fit(X_scaled, t.ravel())
    
    print(gpc.kernel_)
    # 1.95**2 * RBF(length_scale=0.898)
    
    ############################
    # Part 3: GPy implementation
    ############################
    
    import GPy
    from GPy.likelihoods.link_functions import GPTransformation
    
    kern = GPy.kern.RBF(
        input_dim=1,
        variance=1.,
        lengthscale=1.)
    kern.lengthscale.unconstrain()
    kern.variance.unconstrain()
    kern.lengthscale.constrain_bounded(1e-3, 10)
    kern.variance.constrain_bounded(1e-3, 10)
    
    # Create custom link function
    class LogitLink(GPTransformation):
        def transf(self, f):
            """Sigmoid function"""
            return 1 / (1 + np.exp(-f))
    
        def dtransf_df(self, f):
            """First derivative of sigmoid"""
            p = self.transf(f)
            return p * (1 - p)
    
        def d2transf_df2(self, f):
            """Second derivative of sigmoid"""
            p = self.transf(f)
            return p * (1 - p) * (1 - 2 * p)
    
        def d3transf_df3(self, f):
            """Third derivative of sigmoid"""
            p = self.transf(f)
            return p * (1 - p) * (1 - 6 * p + 6 * p**2)
    
    logit_link = LogitLink()
    
    m = GPy.core.GP(
        X=X,Y=t, kernel=kern, 
        inference_method=GPy.inference.latent_function_inference.laplace.Laplace(),    
        likelihood=GPy.likelihoods.Bernoulli(gp_link=logit_link))
    
    m.optimize_restarts(
        num_restarts=10, optimizer='lbfgs',
        verbose=True, robust=True)
    
    print(m.kern)
    #  rbf.         |               value  |  constraints  |  priors
    #  variance     |  3.7922168280178328  |  0.001,10.0   |        
    #  lengthscale  |  0.8977846144774547  |  0.001,10.0   |
    

    The scikit-learn implementation (Part 2 of the code) uses the sigmoid mapping (logit link function), which is why the results with it made more sense. Now, both implementation (scikit-learn and GPy) return the same lengthscale and variance values after optimization.