The Following are the .cv APIs of lightgbm
lightgbm.cv(params, train_set, num_boost_round=100, folds=None, nfold=5, stratified=True, shuffle=True, metrics=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', fpreproc=None, seed=0, callbacks=None, eval_train_metric=False, return_cvbooster=False)
There is a parameter cateogrical_feature
Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name as well).
Now the .train API
lightgbm.train(params, train_set, num_boost_round=100, valid_sets=None, valid_names=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', keep_training_booster=False, callbacks=None)
Here also there is a categorical_feature
parameter. The documentation for this is the same as above
Now, as you notice both the APIs consume the lightgbm dataset which, itself takes a categorical_feature
parameter. The documentation is exactly the same
Questions:
These competing patterns in lightgbm.cv()
in the lightgbm
package have been in the library since September 2017 (this commit). The ability to specify that in both interfaces was added mainly for convenience. It isn't functionally different from passing those arguments to lightgbm.Dataset()
.
If both are specified which one takes precedence?
Which one is the suggested place to specify the categorical_feature?
Are the two choices in any way different internally to the working of the lightgbm pipeline?
Always prefer passing it to lightgbm.Dataset
, and ignore the argument to lightgbm.cv()
/ lightgbm.train()
.
The categorical_feature
argument passed into lightgbm.cv()
/ lightgbm.train()
is only used in one place, in a call to Dataset.set_categorical_feature()
inside the lightgbm.cv()
/ lightgbm.train()
function. At best, this will be useless and not update the Dataset
.
At worst, it can cause an error if the raw data is no longer available.
import lightgbm as lgb
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1_000, n_features=10)
dtrain = lgb.Dataset(
X,
label=y,
categorical_feature=[1, 4],
free_raw_data=True
)
dtrain.construct()
bst = lgb.train(
params={"objective": "regression"},
train_set=dtrain,
categorical_feature=[1, 3]
)
# lightgbm.basic.LightGBMError: Cannot set categorical feature after freed raw data,
# set free_raw_data=False when construct Dataset to avoid this.