pythonscikit-learngridsearchcv

I got the warning "UserWarning: One or more of the test scores are non-finite" when revising a toy scikit-learn gridsearchCV example


I have the following code which works normally but got a

UserWarning: One or more of the test scores are non-finite: [nan nan]
  category=UserWarning

when I revised it into a more concise version (shown in the subsequent code snippet). Is the output of the one-hot encoder the culprit of the issue?

import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import RidgeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV

train = pd.read_csv('/train.csv')
test = pd.read_csv('/test.csv')
sparse_features = [col for col in train.columns if col.startswith('cat')]
dense_features = [col for col in train.columns if col not in sparse_features+['target']]
X = train.drop(['target'], axis=1)
y = train['target'].values
skf = StratifiedKFold(n_splits=5)
clf = RidgeClassifier()

full_pipeline = ColumnTransformer(transformers=[
    ('num', StandardScaler(), dense_features),
    ('cat', OneHotEncoder(), sparse_features)
])
X_prepared = full_pipeline.fit_transform(X)
param_grid = {
    'alpha': [ 0.1],
    'fit_intercept': [False]
}
gs = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    scoring='roc_auc',
    n_jobs=-1,
    cv=skf
)
gs.fit(X_prepared, y)

The revision is shown below.

clf2 = RidgeClassifier()
preprocess_pipeline2 = ColumnTransformer([
    ('num', StandardScaler(), dense_features),
    ('cat', OneHotEncoder(), sparse_features)
])
from sklearn.pipeline import Pipeline
final_pipeline = Pipeline(steps=[
    ('p', preprocess_pipeline2),
    ('c', clf2)
])
param_grid2 = {
    'c__alpha': [0.4, 0.1],
    'c__fit_intercept': [False]
}
gs2 = GridSearchCV(
    estimator=final_pipeline,
    param_grid=param_grid2,
    scoring='roc_auc',
    n_jobs=-1,
    cv=skf
)
gs2.fit(X, y)

Can anyone point out which part goes wrong?

EDIT: After setting error_score to raise, I can receive more feedback regarding the issue. It seems to me that I need to fit the one-hot encoder on the merged dataset that combines the training set and the test set. Am I correct? But if it is the case, why doesn't the first version complain about the same issue? BTW, does it make sense to introduce the argument handle_unknown='ignore' to handle this issue?

ValueError
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
    r = call_item()
  File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/opt/conda/lib/python3.7/site-packages/sklearn/utils/fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 620, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 200, in __call__
    sample_weight=sample_weight)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 334, in _score
    y_pred = method_caller(clf, "decision_function", X)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 53, in _cached_call
    return getattr(estimator, method)(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/utils/metaestimators.py", line 120, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py", line 493, in decision_function
    Xt = transform.transform(Xt)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 565, in transform
    Xs = self._fit_transform(X, None, _transform_one, fitted=True)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 444, in _fit_transform
    self._iter(fitted=fitted, replace_strings=True), 1))
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 1044, in __call__
    while self.dispatch_one_batch(iterator):
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/opt/conda/lib/python3.7/site-packages/sklearn/utils/fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py", line 733, in _transform_one
    res = transformer.transform(X)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 462, in transform
    force_all_finite='allow-nan')
  File "/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 136, in _transform
    raise ValueError(msg)
ValueError: Found unknown categories ['MR', 'MW', 'DA'] in column 10 during transform
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-48-b81f3b7b0724> in <module>
     21     cv=skf
     22 )
---> 23 gs2.fit(X, y)

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    839                 return results
    840 
--> 841             self._run_search(evaluate_candidates)
    842 
    843             # multimetric is determined here because in the case of a callable

/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
   1286     def _run_search(self, evaluate_candidates):
   1287         """Search all candidates in param_grid"""
-> 1288         evaluate_candidates(ParameterGrid(self.param_grid))
   1289 
   1290 

/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results)
    807                                    (split_idx, (train, test)) in product(
    808                                    enumerate(candidate_params),
--> 809                                    enumerate(cv.split(X, y, groups))))
    810 
    811                 if len(out) < 1:

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

/opt/conda/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

/opt/conda/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

ValueError: Found unknown categories ['MR', 'MW', 'DA'] in column 10 during transform

Solution

  • First, I'd like to say that I had a similar issue, and thanks for noting

    After setting error_score to raise

    which really helped me with my issue. I was using a custom transformer and I had some code that was creating variables in the training fold, and then not creating them in the validation fold because those categories were not present in validation. I think you're having a similar problem.

    Seems like OneHotEncoder is maybe creating some categories in your training fold and then finding new categories in your validation fold that are unknown to it because they didn't exist in the training fold.

    ValueError: Found unknown categories ['MR', 'MW', 'DA'] in column 10 during transform

    To get around this, my suggestion would be to look into using custom transformers instead since your data is more complex. https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65