gpflow

Automatic relevance determination with prior distribution for lengthscale


I'm trying to use GPflow to fit a GP. I would like to use automatic relevance determination and a prior for the lengthscales.

I know how to do both separately:

kernel = gpflow.kernels.SquaredExponential(lengthscales=([10] * X_train.shape[1]))

and

kernel.lengthscales.prior = tfp.distributions.Gamma(
    to_default_float(3), to_default_float(0.25)
)

...but I would like to do both (so basically a different Gamma distribution as a prior for each feature).

I tried just using both lines of code and there is no error, but the prior does not seem to add or change anything.

How can I combine both these things?

EDIT: I played with it some more and I thought maybe it would not matter that much, as the lengthscales are adjusted during training. However, the starting point of the lengthscales has a significant impact on the accuracy of my model, and they never change dramatically from the starting point.

For instance, initializing with lengthscales = 10 gives optimized lengthscales between 7 - 13, 15 gives 12-18, etc. Initialising with smaller lengthscales such as 0.1 or 1 leads to lengthscales closer to 10.

Still, I think it would be very valuable if it would be possible to set a prior for every feature as to use ARD. I might investigate next if that is (only?) possible using MCMC methods.


Solution

  • You can have different prior distributions for each input dimension, too: you just need to define the prior accordingly. For example,

    alpha_ard = np.array([3.0, 4.0, 3.0, 3.0, 7.0])
    beta_ard = np.array([0.25, 0.4, 0.4, 0.25, 0.25])
    kernel.lengthscales.prior = tfp.distributions.Gamma(alpha_ard, beta_ard)
    

    Note that the prior now has a batch_shape of [5] instead of [] as before.

    This does change things, as you can verify with the following simple example:

    import gpflow
    import numpy as np
    import tensorflow_probability as tfp
    import tensorflow as tf
    
    num_data = 1
    input_dim = 5
    output_dim = 1
    
    X_train = np.ones((num_data, input_dim))
    Y_train = np.ones((num_data, output_dim))
    
    single_prior = tfp.distributions.Gamma(np.float64(3.0), np.float64(0.25))
    ard_equal_priors = tfp.distributions.Gamma(np.array([3.0]*5), np.array([0.25]*5))
    ard_different_priors = tfp.distributions.Gamma(np.array([1.0, 2.0, 3.0, 4.0, 5.0]), np.array([0.25, 0.1, 0.5, 0.2, 0.3]))
    
    def build_model(prior):
        kernel = gpflow.kernels.SquaredExponential(lengthscales=([1.0] * input_dim))
        kernel.lengthscales.prior = prior
        model = gpflow.models.GPR((X_train, Y_train), kernel, noise_variance=0.01)
        opt = gpflow.optimizers.Scipy()
        opt.minimize(model.training_loss, model.trainable_variables)
    
    m1 = build_model(single_prior)
    m2 = build_model(ard_equal_priors)
    m3 = build_model(ard_different_priors)
    

    m1 and m2 end up with exactly the same lengthscales, whereas m3 is different.

    Without using MCMC, the hyperparameters are just point estimates (maximum likelihood [MLE] without setting priors, or maximum a-posteriori [MAP] with setting priors like you did). There are commonly several local optima, so which one you end up in does depend on the initialization. See this distill article for more explanation.