xgboost azure-machine-learning-service categorical-data

Azure ML- XG Boost Categorical Issues

XG Boost Categorical Issues during Time Series Prediction

I observed an issue where categorical variables are being converted back to strings, causing failures. The issue arises during model serialization (log_model) – even though enable_categorical=True is set consistently across training, logging, and inference. MLflow is unable to correctly preserve categorical dtypes when logging the model. Please suggest a solution.

Details

Requirement:
XGBoost model for time series prediction without encoding. We should use categorical variables across training, logging, and inference. We have more than 200+ unique combinations of category variables.
Dataset:
Contains 4 columns:
- MACHINE_SERIAL_NUMBER
- RECIPE
- Date
- Volume
The grouping of MACHINE_SERIAL_NUMBER & RECIPE is unique. The dataset is completely unique and independent for each grouping of MACHINE_SERIAL_NUMBER & RECIPE.
Issue #1 (during model serialization log_model):
The columns MACHINE_SERIAL_NUMBER and RECIPE are converted to categorical variables and enable_categorical=True is set before training. However, these variables are being converted back to objects during the log_model process. This results in the error:

ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter enable_categorical must be set to True. Invalid columns: MACHINE_SERIAL_NUMBER: object, RECIPE: object

Issue #2 (during inference - real time endpoint):
When a JSON input hits the inference endpoint, the question is how to pass categorical values through JSON input. This also fails due to Issue #1.

Log file for Error #1 during model serialization (log_model):

WARNING mlflow.utils.requirements_utils: Failed to run predict on input_example, dependencies introduced in predict are not captured. ValueError('DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter enable_categorical must be set to True. Invalid columns:MACHINE_SERIAL_NUMBER: object, RECIPE: object')
Traceback (most recent call last): 
12:28:55 WARNING mlflow.models.model: Failed to validate serving input example { "dataframe_split": { "columns": [ "MACHINE_SERIAL_NUMBER", "RECIPE", "Year", "Month", "Day" ], "data": [ [ "BTB0001241", "DECAF", 2024, 2, 9 ] ] } }. Alternatively, you can avoid passing input example and pass model signature instead when logging the model. To ensure the input example is valid prior to serving, please try calling `mlflow.models.validate_serving_input` on the model uri and serving input example. A serving input example can be generated from model input example using `mlflow.models.convert_input_example_to_serving_input` function.

Issue #2 Details (during inference predict):

ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter enable_categorical must be set to True. Invalid columns: MACHINE_SERIAL_NUMBER: object, RECIPE: object
The above exception was the direct cause of the following exception:

Example of DataFrame used:

MACHINE_SERIAL_NUMBER	RECIPE	Volume	Year	Month	Day
B1241	COFFE	61.700	2024	4	24
B1241	COFFE	216.139	2024	4	25
B1241	COFFE	184.455	2024	4	26
B1241	COFFE	334.700	2024	4	27
B1241	COFFE	152.100	2024	4	28

Training & Conversion Code

import mlflow
import mlflow.xgboost
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from mlflow.models.signature import infer_signature
import joblib

# Load Dataset
df = pd.read_csv("./data.csv")

# Convert Date column to datetime
df["Date"] = pd.to_datetime(df["Date"], format="%d-%m-%Y")  # Ensure correct format

# Extract numerical date features
df["Year"] = df["Date"].dt.year.astype(np.int64)
df["Month"] = df["Date"].dt.month.astype(np.int64)
df["Day"] = df["Date"].dt.day.astype(np.int64)

# Drop original Date column
df.drop(columns=["Date"], inplace=True)

# Convert categorical columns
df["MACHINE_SERIAL_NUMBER"] = df["MACHINE_SERIAL_NUMBER"].astype("category")
df["RECIPE"] = df["RECIPE"].astype("category")

# Split into X and y
y = df["Volume"]
X = df.drop(columns=["Volume"])

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost Model
model = XGBRegressor(enable_categorical=True, tree_method="hist")
model.fit(X_train, y_train)
joblib.dump(model, "model.pkl")

# Make Predictions
y_pred = model.predict(X_test)

# Prepare Input Example
input_example = pd.DataFrame([{
    'MACHINE_SERIAL_NUMBER': 'BTB0001241',
    'RECIPE': 'DECAF',
    'Year': 2024,
    'Month': 2,
    'Day': 9
}])

# Convert input_example to match training data types
input_example["MACHINE_SERIAL_NUMBER"] = input_example["MACHINE_SERIAL_NUMBER"].astype("category")
input_example["RECIPE"] = input_example["RECIPE"].astype("category")

# Infer Signature
signature = infer_signature(input_example, y_pred)

# MLflow Logging Code
with mlflow.start_run():
    mlflow.xgboost.log_model(
        model,
        artifact_path="xgb_model",
        signature=signature,
        input_example=input_example,
        registered_model_name="XGBoost_TimeSeries_Model",
    )
    mlflow.log_params({"enable_categorical": True, "tree_method": "hist"})

Example JSON used during Inference

{
    "input_data": {
        "columns": [
            "MACHINE_SERIAL_NUMBER",
            "RECIPE",
            "Year",
            "Month",
            "Day"
        ],
        "data": [
            [
                "BTB0001241",
                "DECAF",
                2025,
                3,
                3
            ]
        ]
    }
}

Solution

The issue is that MLflow doesn't preserve pandas categorical dtypes when serializing XGBoost models. This is a known limitation with XGBoost and MLflow.

The categorical variables (MACHINE_SERIAL_NUMBER and RECIPE) are being converted back to object types during MLflow's model logging, even though you set them as categorical during training and set enable_categorical=True. Because of this, your inference endpoint fails when receiving JSON with these categorical variables.

Here's how to fix:

# Create a custom pyfunc model that handles the conversion
import mlflow.pyfunc
import pandas as pd

class CategoricalXGBoostWrapper(mlflow.pyfunc.PythonModel):
    def __init__(self, model):
        self.model = model
        
    def predict(self, context, model_input):
        # Convert string columns to categorical
        df = pd.DataFrame(model_input)
        df["MACHINE_SERIAL_NUMBER"] = df["MACHINE_SERIAL_NUMBER"].astype("category")
        df["RECIPE"] = df["RECIPE"].astype("category")
        
        # Make prediction
        return self.model.predict(df)

# Use this wrapper when logging your model
with mlflow.start_run():
    # Log the model using custom wrapper
    wrapped_model = CategoricalXGBoostWrapper(model)
    
    mlflow.pyfunc.log_model(
        artifact_path="xgb_model",
        python_model=wrapped_model,
        signature=signature,
        input_example=input_example,
        registered_model_name="XGBoost_TimeSeries_Model",
    )
    
    mlflow.log_params({"enable_categorical": True, "tree_method": "hist"})