pythonmachine-learningcatboost

CatBoost randomized_search with train/test splits


The CatBoost documentation says the randomized_search method can accept train and test splits via the cv parameter, instead of defining a cross validation approach. To do this, one should provide:

An iterable yielding train and test splits as arrays of indices.

How do we define this object?

As a broken example, say my feature dataset has 10 rows. I want to use the first 5 rows for training, and the last 5 rows for validation/testing.

I extract the index values

train_index = X[0:5].index
test_index = X[5:10].index

I supply the indexes to the randomized_search method

a_search = model.randomized_search(param_distributions=params, 
                                   X = X,
                                   y = y,
                                   n_iter=5,
                                   cv={train_index,test_index})

This set that I provide in cv={train_index,test_index} is a non-starter, as it's not iterable, but I am at a loss as to how such an iterable should look. I simply want to define which rows of X and y should be used for training, and which for testing. The goal is to speed up training by dispensing with cross validation, and using a dedicated validation dataset.


Solution

  • You could use a list of tuples:

    cv=[(train_index, test_index)]
    

    Example:

    import pandas as pd
    import numpy as np
    from catboost import CatBoost
    
    # generate the data
    data = np.random.normal(loc=0, scale=1, size=(10, 3))
    labels = np.mean(a=data, axis=1)
    
    df = pd.DataFrame(
        data=np.hstack([data, np.expand_dims(labels, axis=1)]),
        columns=['x1', 'x2', 'x3', 'y']
    )
    
    # split the data
    train_index = df.index[:5]
    test_index = df.index[5:]
    
    # instantiate the model
    model = CatBoost()
    
    # tune the model
    results = model.randomized_search(
        param_distributions={
            'depth': [2, 3, 4],
            'iterations': [5, 6, 7]
        },
        X=df[['x1', 'x2', 'x3']],
        y=df['y'],
        cv=[(train_index, test_index)],
    )