I'm trying to use GPflow to fit a GP. I would like to use automatic relevance determination and a prior for the lengthscales.
I know how to do both separately:
kernel = gpflow.kernels.SquaredExponential(lengthscales=([10] * X_train.shape[1]))
and
kernel.lengthscales.prior = tfp.distributions.Gamma(
to_default_float(3), to_default_float(0.25)
)
...but I would like to do both (so basically a different Gamma distribution as a prior for each feature).
I tried just using both lines of code and there is no error, but the prior does not seem to add or change anything.
How can I combine both these things?
EDIT: I played with it some more and I thought maybe it would not matter that much, as the lengthscales are adjusted during training. However, the starting point of the lengthscales has a significant impact on the accuracy of my model, and they never change dramatically from the starting point.
For instance, initializing with lengthscales = 10 gives optimized lengthscales between 7 - 13, 15 gives 12-18, etc. Initialising with smaller lengthscales such as 0.1 or 1 leads to lengthscales closer to 10.
Still, I think it would be very valuable if it would be possible to set a prior for every feature as to use ARD. I might investigate next if that is (only?) possible using MCMC methods.
You can have different prior distributions for each input dimension, too: you just need to define the prior accordingly. For example,
alpha_ard = np.array([3.0, 4.0, 3.0, 3.0, 7.0])
beta_ard = np.array([0.25, 0.4, 0.4, 0.25, 0.25])
kernel.lengthscales.prior = tfp.distributions.Gamma(alpha_ard, beta_ard)
Note that the prior now has a batch_shape
of [5] instead of [] as before.
This does change things, as you can verify with the following simple example:
import gpflow
import numpy as np
import tensorflow_probability as tfp
import tensorflow as tf
num_data = 1
input_dim = 5
output_dim = 1
X_train = np.ones((num_data, input_dim))
Y_train = np.ones((num_data, output_dim))
single_prior = tfp.distributions.Gamma(np.float64(3.0), np.float64(0.25))
ard_equal_priors = tfp.distributions.Gamma(np.array([3.0]*5), np.array([0.25]*5))
ard_different_priors = tfp.distributions.Gamma(np.array([1.0, 2.0, 3.0, 4.0, 5.0]), np.array([0.25, 0.1, 0.5, 0.2, 0.3]))
def build_model(prior):
kernel = gpflow.kernels.SquaredExponential(lengthscales=([1.0] * input_dim))
kernel.lengthscales.prior = prior
model = gpflow.models.GPR((X_train, Y_train), kernel, noise_variance=0.01)
opt = gpflow.optimizers.Scipy()
opt.minimize(model.training_loss, model.trainable_variables)
m1 = build_model(single_prior)
m2 = build_model(ard_equal_priors)
m3 = build_model(ard_different_priors)
m1
and m2
end up with exactly the same lengthscales, whereas m3
is different.
Without using MCMC, the hyperparameters are just point estimates (maximum likelihood [MLE] without setting priors, or maximum a-posteriori [MAP] with setting priors like you did). There are commonly several local optima, so which one you end up in does depend on the initialization. See this distill article for more explanation.