pythonmachine-learningscikit-learnlightgbm

Why can't I wrap LGBM?


I'm using LGBM to forecast the relative change of a numerical quantity. I'm using the MSLE (Mean Squared Log Error) loss function to optimize my model and to get the correct scaling of errors. Since MSLE isn't native to LGBM, I have to implement it myself. But lucky me, the math can be simplified a ton. This is my implementation;

class MSLELGBM(LGBMRegressor):
    def __init__(self,  **kwargs): 
        super().__init__(**kwargs)

    def predict(self, X):
        return np.exp(super().predict(X))
    
    def fit(self, X, y, eval_set=None, callbacks=None):
        y_log = np.log(y.copy())
        print(super().get_params())  # This doesn't print any kwargs
        if eval_set:
            eval_set = [(X_eval, np.log(y_eval.copy())) for X_eval, y_eval in eval_set]
        super().fit(X, y_log, eval_set=eval_set, callbacks=callbacks)

As you can see, it's very minimal. I basically just need to apply a log transform to the model target, and exponentiate the predictions to return to our own non-logarithmic world.

However, my wrapper doesn't work. I call the class with;

model = MSLELGBM(**lgbm_params)
model.fit(data[X_cols_all], data[y_col_train]) 

And I get the following exception;


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[31], line 38
     32 callbacks = [
     33     lgbm.early_stopping(10, verbose=0), 
     34     lgbm.log_evaluation(period=0),
     35 ]
     37 model = MSLELGBM(**lgbm_params)
---> 38 model.fit(data[X_cols_all], data[y_col_train]) 
     40 feature_importances_df = pd.DataFrame([model.booster_.feature_importance(importance_type='gain')], columns=X_cols_all).T.sort_values(by=0, ascending=False)
     41 feature_importances_df.iloc[:30]

Cell In[31], line 17
     15 if eval_set:
     16     eval_set = [(X_eval, np.log(y_eval.copy())) for X_eval, y_eval in eval_set]
---> 17 super().fit(X, y_log, eval_set=eval_set, callbacks=callbacks)

File c:\X\.venv\lib\site-packages\lightgbm\sklearn.py:1189, in LGBMRegressor.fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_init_score, eval_metric, feature_name, categorical_feature, callbacks, init_model)
   1172 def fit(  # type: ignore[override]
   1173     self,
   1174     X: _LGBM_ScikitMatrixLike,
   (...)
   1186     init_model: Optional[Union[str, Path, Booster, LGBMModel]] = None,
   1187 ) -> "LGBMRegressor":
   1188     """Docstring is inherited from the LGBMModel."""
...
--> 765 if isinstance(params["random_state"], np.random.RandomState):
    766     params["random_state"] = params["random_state"].randint(np.iinfo(np.int32).max)
    767 elif isinstance(params["random_state"], np.random.Generator):

KeyError: 'random_state'

I have no idea how random_state is missing from the fit method, as it isnt even required for that function. I get the impression that this is a complicated software engineering issue that's above my head. Anybody knows whats up?

If it's of any help, I tried illustrating what I want using a simpler non-lgbm structure;

passing kwargs

I just want to pass whatever parameters I provide to the MSLELGBM to the original LGBM, but I'm running into a ton of issues doing so.


Solution

  • Root Cause

    scikit-learn expects that each of the keyword arguments to an estimator's __init__() will exactly correspond to a public attribute on instances of the class. Per https://scikit-learn.org/stable/developers/develop.html

    every keyword argument accepted by __init__ should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection

    Its .get_params() method on estimators take advantage of this by inspecting the signature of __init__() to figure out which attributes to expect (scikit-learn / sklearn / base.py).

    lightgbm's estimators call .get_params() and then expect the key "random_state" to exist in the dictionary it returns... because that parameter is in the keyword arguments to LGBMRegressor (LightGBM / python-package / lightgbm / sklearn.py).

    Your estimator's __init__() does not have random_state as a keyword argument, so when self.get_params() is called it returns a dictionary that does not contain "random_state", leading to the error your observed.

    How to fix this

    If you do not need to add any other custom parameters, then just do not define an __init__() method on your subclass.

    Here's a minimal, reproducible example that works with lightgbm 4.5.0 and Python 3.11:

    import numpy as np
    from lightgbm import LGBMRegressor
    from sklearn.datasets import make_regression
    
    
    class MSLELGBM(LGBMRegressor):
    
        def predict(self, X):
            return np.exp(super().predict(X))
    
        def fit(self, X, y, eval_set=None, callbacks=None):
            y_log = np.log(y.copy())
            if eval_set:
                eval_set = [(X_eval, np.log(y_eval.copy())) for X_eval, y_eval in eval_set]
            super().fit(X, y_log, eval_set=eval_set, callbacks=callbacks)
    
    # modifying bias and tail_strength to ensure every value in 'y' is positive
    X, y = make_regression(
        n_samples=5_000,
        n_features=3,
        bias=500.0,
        tail_strength=0.001,
        random_state=708,
    )
    
    reg = MSLELGBM(num_boost_round=5)
    
    # print params (you'll see all the LGBMRegressor params)
    reg.get_params()
    
    # fit the model
    reg.fit(X, y)
    

    If you do need to define any custom parameters, then for lightgbm<=4.5.0:

    Like this:

    class MSLELGBM(LGBMRegressor):
        
        # just including 'random_state' to keep it short... you
        # need to include more params here, depending on LightGBM version
        def __init__(self, random_state=None, **kwargs):
            super().__init__(
                random_state=random_state,
                **kwargs
            )