I'm on a search problem, I have a dataset of queries and urls. Each couple (query, url) has a relevance (the target), a float which should preserve the order of the urls, for a given query. I would like to perform cross validation for my lightgbm.LGBMRanker model, with the objective as ndcg.
I went through the documentation and saw that it is important to keep the instances in the same group
, because an instance is actually a query with all its associated urls.
I however have an issue regarding this, as I get the following error :
ValueError: Computing NDCG is only meaningful when there is more than 1 document. Got 1 instead.
I used the debugger, and while I do not have any group which size is inferior to 2
in my dataset, I have groups which are smaller in the _feval
function, meaning the cv()
fucntion did not actually keep the groups together.
In the lightgbm.cv I see no sign of the group
argument which is used in the LGBMRanker.
But I can see that the function lightbm.cv precises that Values passed through params take precedence over those supplied via arguments
. My understanding was that this value is passed to the underlying model of the cv function.
Here is the code that I have so far :
def eval_model(
self,
model: lightgbm.LGBMRanker,
k_fold: int = 3,
seed: int = 42,
):
"""Evaluates with NDCG"""
def _feval(y_pred: np.ndarray, lgb_dataset: lightgbm.basic.Dataset):
y_true = lgb_dataset.get_label()
serp_sizes = lgb_dataset.get_group()
ndcg_values = []
start = 0
for size in serp_sizes:
end = start + size
y_true_serp, y_pred_serp = y_true[start:end], y_pred[start:end]
ndcg_serp = sklearn.metrics.ndcg_score(
[y_true_serp], [y_pred_serp], k=10
)
ndcg_values.append(ndcg_serp)
start = end
eval_name = "my-ndcg"
eval_result = np.mean(ndcg_values)
greater_is_better = True
return eval_name, eval_result, greater_is_better
lgb_dataset = lightgbm.Dataset(data=self.X, label=self.y, group=self.serp_sizes)
cv_results = lightgbm.cv(
params={**model.get_params(), "group": self.serp_sizes},
train_set=lgb_dataset,
num_boost_round=1_000,
nfold=k_fold,
stratified=False,
seed=seed,
feval=_feval,
)
ndcg = np.mean(cv_results["my-ndcg"])
return ndcg
Where is my mistake/misunderstanding ?
is there a simple workaround to perform cross-validation using a lightgbm.LGBMRanker
, and keeping the groups together ?
I would like to perform cross validation for my lightgbm.LGBMRanker model, with the objective as ndcg.
As of lightgbm==4.1.0
(the latest version as of this writing), lightgbm.sklearn.LGBMRanker
cannot be used with scikit-learn
's cross-validation APIs.
It also cannot be passed to lightgbm.cv()
.
In the lightgbm.cv I see no sign of the group argument which is used in the LGBMRanker
As described in LightGBM's documentation (link), lightgbm.cv()
expects to be passed a lightgbm.Dataset
object.
group
is an attribute of the Dataset
object.
To perform cross-validation of a LightGBM learning-to-rank model, use lightgbm.cv()
instead of lightgbm.sklearn.LGBMRanker()
.
Here's a minimal, reproducible example using 3.11.7 and lightgbm==4.1.0
.
import lightgbm as lgb
import numpy as np
import requests
from sklearn.datasets import load_svmlight_file
from tempfile import NamedTemporaryFile
# get training data from LightGBM examples
data_url = "https://raw.githubusercontent.com/microsoft/LightGBM/master/examples/lambdarank"
with NamedTemporaryFile(mode="w") as f:
train_data_raw = requests.get(f"{data_url}/rank.train").text
f.write(train_data_raw)
X, y = load_svmlight_file(f.name)
group = np.loadtxt(f"{data_url}/rank.train.query")
# create a LightGBM Dataset
dtrain = lgb.Dataset(
data=X,
label=y,
group=group
)
# perform LambdaRank 3-fold cross-validation with 1 set of hyperparameters
cv_results = lgb.cv(
train_set=dtrain,
params={
"objective": "lambdarank",
"eval_at": 2,
"num_iterations": 10
},
nfold=3,
return_cvbooster=True
)
# check metrics
np.round(cv_results["valid ndcg@2-mean"], 3)
# array([0.593, 0.597, 0.64 , 0.632, 0.64 , 0.636, 0.655, 0.655, 0.653, 0.669])
lightgbm.cv()
will correctly preserve query groups when creating cross-validation folds.
Values passed through params take precedence over those supplied via arguments
In LightGBM's documentation, "param" refers specifically to the configuration described at https://lightgbm.readthedocs.io/en/v4.1.0/Parameters.html.
The statement you've quoted does not apply to data like group
, init_score
, and label
, and those things should not be passed through the params
keyword argument in any of LightGBM's interfaces.