I've been working with hierarchical time series, and, as a result, I needed to create my own CV to make sure that all timestamps and products are included evenly in the test (validation) set. It worked just fine for sklearn, but I can't make it work in pycaret: best = compare_models()
yields nothing at all. Here is the custom CV I used:
class custom_cv:
def __init__(self, train_end, test_size, n_splits): # val_size
self.train_end = train_end
# self.val_size = val_size
self.test_size = test_size
self.n_splits = n_splits
def split(self, X):
self.X = X
for i in range(self.n_splits, 0, -1): # range(start, stop, step)
tr_threshol = self.train_end - self.test_size*i
te_threshol = tr_threshol + self.test_size
tr_idx = np.array(self.X.reset_index(drop = True).index[self.X['N_month'] <= tr_threshol])
te_idx = np.array(self.X.index[(self.X['N_month'] > tr_threshol) & (self.X['N_month'] <= te_threshol)])
yield(tr_idx, te_idx)
custom_CV = custom_cv(train_end = 365, test_size = 28, n_splits = 5)
# custom_CV = custom_CV.split(X = df)
My Data looks like this: 1
For sklearn I used the following loop:
def custom_cv(df, train_end = 36, test_size = 4, n_splits = 4):
cv_idx = []
for i in range(n_splits, 0, -1): # range(start, stop, step)
tr_threshol = train_end - test_size*i
te_threshol = tr_threshol + test_size
tr_idx = list(df.reset_index(drop = True).index[df['N_month'] <= tr_threshol])
te_idx = list(df.index[(df['N_month'] > tr_threshol) & (df['N_month'] <= te_threshol)])
cv_idx.append((tr_idx, te_idx))
return cv_idx
custom_CV = custom_cv(df = df, train_end = 365, test_size = 28, n_splits = 5)
However, pycaret requires a custom CV generator object compatible with scikit-learn
(something I've never dealt with before). I can't figure out what's wrong exactly, and I hope you can kindly help me out.
Your class for Pycaret is probably missing the get_n_splits method.I had similar problem and solved with the class structure like here:
How to generate a custom cross-validation generator in scikit-learn?