pythontime-seriesforecastinglightgbmu8darts

Darts and LightGBM: original column names cannot be retrieved for feature importance


Problem I am running a LightGBMModel via Darts with some (future) covariates. I want to understand the relevance of the different (lagged) features.

In particular, I would like to retrieve the feature importance for the lagged target variable as well as for the covariates using the original column names from the Darts TimeSeries object. In the LightGBM model object after fitting I can only see generic column names ("column_0", "column_1"). How can I connect this to meaningful names (e.g., target_lag_1, target_lag_2, name_of_covariate_lag_1, ...).

I want to include several future covariates (e.g., several datetime attributes like day of week with different encodings). It does not matter where the datetime attributes are created (e.g., using pandas, using Darts itself).

Minimal reproducable example I adopted the example from the documentation

This is the code from the documentation, just setting up the data and fitting the model:

from darts.datasets import WeatherDataset
from darts.models import LightGBMModel


series = WeatherDataset().load()


# predicting atmospheric pressure
target = series['p (mbar)'][:100]


# optionally, use past observed rainfall (pretending to be unknown beyond index 100)
past_cov = series['rain (mm)'][:100]


# optionally, use future temperatures (pretending this component is a forecast)
future_cov = series['T (degC)'][:106]


# predict 6 pressure values using the 12 past values of pressure and rainfall, as well as the 6 temperature
# values corresponding to the forecasted period
model = LightGBMModel(
    lags=12,
    lags_past_covariates=12,
    lags_future_covariates=[0,1,2,3,4,5],
    output_chunk_length=6,
    verbose=-1
)


model.fit(target, past_covariates=past_cov, future_covariates=future_cov)

Having fitted the model, I now want to analyze the importance of the features.

for i, estimator in enumerate(model.model.estimators_):
    print(f"Target {i} Importance (Gain):")

    # Access LightGBM booster
    booster = estimator.booster_

    # Get feature names
    feature_names = booster.feature_name()

    # Get gain-based importance
    importance = booster.feature_importance(importance_type='gain')

    # Create mapping
    named_importance = dict(zip(feature_names, importance))
    print(named_importance)

This returns the feature importance for several columns in each estimator. But the feature names are generic names generated by LightGBM ('Column_1', 'Column_2', ...). I do not know how to link this back to the original column names in the TimeSeries object from Darts (e.g., 'rain (mm)', ''T (degC)') with the additional information which lag a feature importance is referring to.


Solution

  • The features that go into the models are available in model.lagged_feature_names.

    One of the authors addressed feature importances in Issue#1826, doing mostly what you've done, but they also referenced that along with a note about the feature names in Issue#2125.