I am trying to run this code, using a dataset on the relation of Corona cases to Corona deaths. I have not found any reason why the error should appear through the way i handle the split into X and y dataframes, but I do not fully understand the Error either.
Does someone know what is wrong here?
import numpy as np
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn import preprocessing
#import csv
X_test = pd.read_csv("test.csv")
y_output = pd.read_csv("sample_submission.csv")
data_train = pd.read_csv("train.csv")
X_train = data_train.drop(columns=["Next Week's Deaths"])
y_train = data_train["Next Week's Deaths"]
#prepare for fit (transform Location strings into classes)
Location = data_train["Location"]
le = preprocessing.LabelEncoder()
le.fit(Location)
LocationToInt = le.transform(Location)
LocationDict = dict(zip(Location, LocationToInt))
X_train["Location"] = X_train["Location"].replace(LocationDict)
#train and run
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
model.fit(X_train, y_train)
Traceback:
Input In [89], in <cell line: 29>()
27 #train and run
28 model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
---> 29 model.fit(X_train, y_train)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\ensemble\_hist_gradient_boosting\gradient_boosting.py:348, in BaseHistGradientBoosting.fit(self, X, y, sample_weight)
343 # Save the state of the RNG for the training and validation split.
344 # This is needed in order to have the same split when using
345 # warm starting.
347 if sample_weight is None:
--> 348 X_train, X_val, y_train, y_val = train_test_split(
349 X,
350 y,
351 test_size=self.validation_fraction,
352 stratify=stratify,
353 random_state=self._random_seed,
354 )
355 sample_weight_train = sample_weight_val = None
356 else:
357 # TODO: incorporate sample_weight in sampling here, as well as
358 # stratify
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:2454, in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
2450 CVClass = ShuffleSplit
2452 cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
-> 2454 train, test = next(cv.split(X=arrays[0], y=stratify))
2456 return list(
2457 chain.from_iterable(
2458 (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays
2459 )
2460 )
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:1613, in BaseShuffleSplit.split(self, X, y, groups)
1583 """Generate indices to split data into training and test set.
1584
1585 Parameters
(...)
1610 to an integer.
1611 """
1612 X, y, groups = indexable(X, y, groups)
-> 1613 for train, test in self._iter_indices(X, y, groups):
1614 yield train, test
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_split.py:1953, in StratifiedShuffleSplit._iter_indices(self, X, y, groups)
1951 class_counts = np.bincount(y_indices)
1952 if np.min(class_counts) < 2:
-> 1953 raise ValueError(
1954 "The least populated class in y has only 1"
1955 " member, which is too few. The minimum"
1956 " number of groups for any class cannot"
1957 " be less than 2."
1958 )
1960 if n_train < n_classes:
1961 raise ValueError(
1962 "The train_size = %d should be greater or "
1963 "equal to the number of classes = %d" % (n_train, n_classes)
1964 )
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
For Text Highlighting: Picture of Traceback
The HistGradientBoostingClassifier
internally splits your dataset into train and validation. Default is 10% for validation (checkout validation_fraction
param in docs).
In your case, there is a class with a single element on it, so if it goes to the train split, the classifier can't validate this class, or vice versa. The point is: you need at least two examples in each class.
How to solve it? Well, first you need an appropriated diagnosis: run the following code to see which class is the problem:
import bumpy as np
unq, cnt = no.unique(y_train, return_counts=True)
for u, c in zip(unq, cnt):
print(f"class {u} contains {c}")
What to do now? Well, first make sure that those results make sense to you, and there is no a previous error (maybe reading incorrectly your CSV or loosing data some steps before).
Then, if the problem persist, your options are the following:
Collect more data. Not always possible but this is the best.
Add synthetic data. imblearn
for instance, is a sklearn-like library to work on imbalanced problems like yours. It provides several well known oversampling methods. You can also create your own synthetic data, since you know what is it.
Remove classes with a single example. This implies re-framing your problem a little bit but may work. Just drop the row. You can also re-label it to one of the closest labels, for instance, if you have classes positives, negatives and neutral, and a single example of neutral class, well maybe you can re-label it as negative.
Group classes with low cardinality. This is useful when you have multiple classes, let's say 10 classes, and there are some of those, let's say 3, with really few examples. You can Mix those low cardinality classes into a single class "other" and convert your problem to another similar with less classes but more populated, in the example, instead of 10 you will have 8.
What is the best alternative? It really depends on your problem.
EDIT
Previous answer assumes you are solving a classification problem (tell which class an example belongs to). If you are solving a regression task (predict a quantity), replace your HistGradientBoostingClassifier
with HistGradientBoostingRegressor