pythonnlpcatboost

How to feed text features into catboost model.predict


I am trying to use CatBoost for an NLP multiclass classification problem, trying to classify sentences based upon their labels.

This is fine for the training of the model by using the text_features parameter in the model fit:

model.fit(x_train, y_train, text_features=['text'])

However when I want to use text features within the test data, I see no option to provide this and get the following error:

preds_class = model.predict(X_test)

_catboost.CatBoostError: Bad value for num_feature[non_default_doc_idx=0,feature_idx=1]="The Syro-Malabar Catholic Eparchy of Rajkot is an Eastern Catholic eparchy in India under the Syro-Malabar Catholic Church.

There is no text_features option here, so I can't understand how this works?

If someone could clarify how to do this it would be great.

Thanks


Solution

  • In Catboost, text features are added to the model the same way categorical features are. Unless you are using Catboost's Pool method where you can add columns by name, the only way to specify text columns is by pointing to their column number, like this:

    text_cols = ['text_1', 'text_2']
    model.fit(x_train, y_train, text_features=list(range(len(text_cols))))
    
    

    Also FYI in catboost categorical or text columns go through the preprocessor before numeric ones, so the column order will always be reassigned to push them ahead of numeric ones.