xgboostazure-machine-learning-servicecategorical-data

Azure ML- XG Boost Categorical Issues


XG Boost Categorical Issues during Time Series Prediction

I observed an issue where categorical variables are being converted back to strings, causing failures. The issue arises during model serialization (log_model) – even though enable_categorical=True is set consistently across training, logging, and inference. MLflow is unable to correctly preserve categorical dtypes when logging the model. Please suggest a solution.


Details


Training & Conversion Code

import mlflow
import mlflow.xgboost
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from mlflow.models.signature import infer_signature
import joblib

# Load Dataset
df = pd.read_csv("./data.csv")

# Convert Date column to datetime
df["Date"] = pd.to_datetime(df["Date"], format="%d-%m-%Y")  # Ensure correct format

# Extract numerical date features
df["Year"] = df["Date"].dt.year.astype(np.int64)
df["Month"] = df["Date"].dt.month.astype(np.int64)
df["Day"] = df["Date"].dt.day.astype(np.int64)

# Drop original Date column
df.drop(columns=["Date"], inplace=True)

# Convert categorical columns
df["MACHINE_SERIAL_NUMBER"] = df["MACHINE_SERIAL_NUMBER"].astype("category")
df["RECIPE"] = df["RECIPE"].astype("category")

# Split into X and y
y = df["Volume"]
X = df.drop(columns=["Volume"])

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost Model
model = XGBRegressor(enable_categorical=True, tree_method="hist")
model.fit(X_train, y_train)
joblib.dump(model, "model.pkl")

# Make Predictions
y_pred = model.predict(X_test)

# Prepare Input Example
input_example = pd.DataFrame([{
    'MACHINE_SERIAL_NUMBER': 'BTB0001241',
    'RECIPE': 'DECAF',
    'Year': 2024,
    'Month': 2,
    'Day': 9
}])

# Convert input_example to match training data types
input_example["MACHINE_SERIAL_NUMBER"] = input_example["MACHINE_SERIAL_NUMBER"].astype("category")
input_example["RECIPE"] = input_example["RECIPE"].astype("category")

# Infer Signature
signature = infer_signature(input_example, y_pred)

# MLflow Logging Code
with mlflow.start_run():
    mlflow.xgboost.log_model(
        model,
        artifact_path="xgb_model",
        signature=signature,
        input_example=input_example,
        registered_model_name="XGBoost_TimeSeries_Model",
    )
    mlflow.log_params({"enable_categorical": True, "tree_method": "hist"})

Example JSON used during Inference

{
    "input_data": {
        "columns": [
            "MACHINE_SERIAL_NUMBER",
            "RECIPE",
            "Year",
            "Month",
            "Day"
        ],
        "data": [
            [
                "BTB0001241",
                "DECAF",
                2025,
                3,
                3
            ]
        ]
    }
}

Solution

  • The issue is that MLflow doesn't preserve pandas categorical dtypes when serializing XGBoost models. This is a known limitation with XGBoost and MLflow.

    The categorical variables (MACHINE_SERIAL_NUMBER and RECIPE) are being converted back to object types during MLflow's model logging, even though you set them as categorical during training and set enable_categorical=True. Because of this, your inference endpoint fails when receiving JSON with these categorical variables.

    Here's how to fix:

    # Create a custom pyfunc model that handles the conversion
    import mlflow.pyfunc
    import pandas as pd
    
    class CategoricalXGBoostWrapper(mlflow.pyfunc.PythonModel):
        def __init__(self, model):
            self.model = model
            
        def predict(self, context, model_input):
            # Convert string columns to categorical
            df = pd.DataFrame(model_input)
            df["MACHINE_SERIAL_NUMBER"] = df["MACHINE_SERIAL_NUMBER"].astype("category")
            df["RECIPE"] = df["RECIPE"].astype("category")
            
            # Make prediction
            return self.model.predict(df)
    
    # Use this wrapper when logging your model
    with mlflow.start_run():
        # Log the model using custom wrapper
        wrapped_model = CategoricalXGBoostWrapper(model)
        
        mlflow.pyfunc.log_model(
            artifact_path="xgb_model",
            python_model=wrapped_model,
            signature=signature,
            input_example=input_example,
            registered_model_name="XGBoost_TimeSeries_Model",
        )
        
        mlflow.log_params({"enable_categorical": True, "tree_method": "hist"})