pythonpandasparquetcatboostcatboostregressor

Loading data into Catboost Pool object


I'm training a Catboost model and using a Pool object as following:

pool = Pool(data=x_train, label=y_train, cat_features=cat_cols)
eval_set = Pool(data=x_validation, label=y_validation['Label'], cat_features=cat_cols)

model.fit(pool, early_stopping_rounds=EARLY_STOPPING_ROUNDS, eval_set=eval_set)

For the x_train, y_train, x_validation and y_validation, they are from Pandas DataFrame type (The datasets saved as Parquet file, and I use PyArrow to read them into the dataframes). model is a Catboost classifier/regressor.

I'm trying to optimize for large datasets, and my questions are:

  1. When reading the dataset to a Pandas DataFrame (using PyArrow), and then creating the Pool object, am I actually double the amount of memory I'm using to store the dataset? I understood that they copy the data to structure the Pool, and that it's not a reference.
  2. Is there a more efficient way to create the pool? for example to load it directly from libsvm file? like mention here https://catboost.ai/docs/concepts/python-usages-examples.html#load-the-dataset-from-a-file
  3. Is there any way I can load the data into Pool in batches? and not load everything into memory at the beginning?

Solution

    1. Yes, unfortunately, the amount of RAM used is doubled, so it's better to convert your data first to a file format Catboost understands and create your pool from a file then. Why Catboost is using extra RAM - to quantize the dataset. You can prepare a Pool from, say, big Pandas dataframe (has to be loaded into RAM), delete the df, quantize the pool, save it if you think you will have to repeat training later. Note that you can save only a quantized pool. Always say quantization borders if you do so, otherwise, you won't be able to create auxiliary datasets (like validation one) as they will need the same quantization. Simple file formats like csv/tsv Catboost can read directly form disk (and quantize, they have a helper function in untils module).
    2. Yes, exactly like you cited.
    3. You can load batches manually using batch training or go with training continuation. Both will work for your purpose, I have tried them. Training continuation looks more simple (as you only have to provide init_model), but you won't be able to train on GPU (at least, currently). Plus you will be limited to symmetric trees only, and some more limitations on hyperparams. With batched training, you can use GPUs.