pythonpandasscikit-learnjupytersklearn-pandas

"The least populated class in y has only 1 ... groups for any class cannot be less than 2." Without train_test_split()


I am trying to run this code, using a dataset on the relation of Corona cases to Corona deaths. I have not found any reason why the error should appear through the way i handle the split into X and y dataframes, but I do not fully understand the Error either.

Does someone know what is wrong here?

import numpy as np
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn import preprocessing


#import csv
X_test = pd.read_csv("test.csv")
y_output = pd.read_csv("sample_submission.csv")

data_train = pd.read_csv("train.csv")
X_train = data_train.drop(columns=["Next Week's Deaths"])
y_train = data_train["Next Week's Deaths"]

#prepare for fit (transform Location strings into classes)
Location = data_train["Location"]
le = preprocessing.LabelEncoder()
le.fit(Location)

LocationToInt = le.transform(Location)
LocationDict = dict(zip(Location, LocationToInt))

X_train["Location"] = X_train["Location"].replace(LocationDict)


#train and run
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
model.fit(X_train, y_train)

Traceback:

Input In [89], in <cell line: 29>()
     27 #train and run
     28 model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
---> 29 model.fit(X_train, y_train)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\ensemble\_hist_gradient_boosting\gradient_boosting.py:348, in BaseHistGradientBoosting.fit(self, X, y, sample_weight)
    343 # Save the state of the RNG for the training and validation split.
    344 # This is needed in order to have the same split when using
    345 # warm starting.
    347 if sample_weight is None:
--> 348     X_train, X_val, y_train, y_val = train_test_split(
    349         X,
    350         y,
    351         test_size=self.validation_fraction,
    352         stratify=stratify,
    353         random_state=self._random_seed,
    354     )
    355     sample_weight_train = sample_weight_val = None
    356 else:
    357     # TODO: incorporate sample_weight in sampling here, as well as
    358     # stratify

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:2454, in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
   2450         CVClass = ShuffleSplit
   2452     cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
-> 2454     train, test = next(cv.split(X=arrays[0], y=stratify))
   2456 return list(
   2457     chain.from_iterable(
   2458         (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays
   2459     )
   2460 )

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:1613, in BaseShuffleSplit.split(self, X, y, groups)
   1583 """Generate indices to split data into training and test set.
   1584 
   1585 Parameters
   (...)
   1610 to an integer.
   1611 """
   1612 X, y, groups = indexable(X, y, groups)
-> 1613 for train, test in self._iter_indices(X, y, groups):
   1614     yield train, test

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:1953, in StratifiedShuffleSplit._iter_indices(self, X, y, groups)
   1951 class_counts = np.bincount(y_indices)
   1952 if np.min(class_counts) < 2:
-> 1953     raise ValueError(
   1954         "The least populated class in y has only 1"
   1955         " member, which is too few. The minimum"
   1956         " number of groups for any class cannot"
   1957         " be less than 2."
   1958     )
   1960 if n_train < n_classes:
   1961     raise ValueError(
   1962         "The train_size = %d should be greater or "
   1963         "equal to the number of classes = %d" % (n_train, n_classes)
   1964     )

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

For Text Highlighting: Picture of Traceback


Solution

  • The HistGradientBoostingClassifier internally splits your dataset into train and validation. Default is 10% for validation (checkout validation_fraction param in docs).

    In your case, there is a class with a single element on it, so if it goes to the train split, the classifier can't validate this class, or vice versa. The point is: you need at least two examples in each class.

    How to solve it? Well, first you need an appropriated diagnosis: run the following code to see which class is the problem:

    import bumpy as np
    
    unq, cnt = no.unique(y_train, return_counts=True)
    
    for u, c in zip(unq, cnt):
        print(f"class {u} contains {c}")
    

    What to do now? Well, first make sure that those results make sense to you, and there is no a previous error (maybe reading incorrectly your CSV or loosing data some steps before).

    Then, if the problem persist, your options are the following:

    What is the best alternative? It really depends on your problem.

    EDIT Previous answer assumes you are solving a classification problem (tell which class an example belongs to). If you are solving a regression task (predict a quantity), replace your HistGradientBoostingClassifier with HistGradientBoostingRegressor