I have trained successfully a multi-output Gaussian Process model using an GPy.models.GPCoregionalizedRegression
model of the GPy
package. The model has ~25 inputs and 6 outputs.
The underlying kernel is an GPy.util.multioutput.ICM
kernel consisting of an RationalQuadratic kernel GPy.kern.RatQuad
and the GPy.kern.Coregionalize
Kernel.
I am now interested in the feature importance on each individual output. The RatQuad kernel provides an ARD=True
(Automatic Relevance Determination) keyword, which allows to get the feature importance of its output for a single output model (which is also exploited by the get_most_significant_input_dimension()
method of the GPy model).
However, calling the get_most_significant_input_dimension()
method on the GPy.models.GPCoregionalizedRegression
model gives me a list of indices I assume to be the most significant inputs somehow for all outputs.
How can I calculate/obtain the lengthscale values or most significant features for each individual output of the model?
The problem is the model itself. The intrinsic coregionalized model (ICM) is set up such, that all outputs are determined by a shared underlying "latent" Gaussian Process. Thus, calling get_most_significant_input_dimension()
on a GPy.models.GPCoregionalizationRegression
model can only give you one set of input dimensions significant to all outputs together.
The solution is to use a GPy.util.multioutput.LCM
model kernel, which is defined as a sum of ICM kernels with a list of individual (latent) GP kernels. It works as follows
import GPy
# Your data
# x = ...
# y = ...
# # ICM case
# kernel = GPy.util.multioutput.ICM(input_dim=x.shape[1],
# num_outputs=y.shape[1],
# kernel=GPy.kern.RatQuad(input_dim=x.shape[1], ARD=True))
# LCM case
k_list = [GPy.kern.RatQuad(input_dim=x.shape[1], ARD=True) for _ in range(y.shape[1])]
kernel = GPy.util.multioutput.LCM(input_dim=x.shape[1], num_outputs=y.shape[1],
W_rank=rank, kernels_list=k_list)
A reshaping is of the data is needed (This is also necessary for the ICM model and thus independent of the scope of this questions, see here for details)
# Reshaping data to fit GPCoregionalizedRegression
xx = reshape_for_coregionalized_regression(x)
yy = reshape_for_coregionalized_reshaping(y)
m = GPy.models.GPCoregionalizedRegression(xx, yy, kernel=kernel)
m.optimize()
After converged optimization one can call get_most_significant_input_dimension()
on an individual latent GPs (here output 0
).
sig_inputs_0 = m.sum.ICM0.get_most_significant_input_dimensions()
or looping over all kernels
sig_inputs = []
for part in self.gpy_model.kern.parts:
sig_inputs.append(part.get_most_significant_input_dimensions())