xgboostazure-machine-learning-servicecategorical-data

Azure ML- XG Boost Categorical Issues


XG Boost Categorical Issues during Time Series Prediction-: I observed an Issue- Categorical Variables are being converted back to String, causing failures. The issue arises during model serialization (log_model) - though enable_categorical=True is set consistently across training, logging, and inference MLflow is unable to correctly preserving categorical dtypes when logging the model. Please suggest Solution.

Details:

Requirement is: XG Model for Time Series Prediction without Encoding. So We should use Category variables across training, logging, and inference. We have more than 200 + Unique Combinations of Category Variables.

Data Set contains 4 Columns: MACHINE_SERIAL_NUMBER, RECIPE, Date, Volume. Grouping of MACHINE_SERIAL_NUMBER & RECIPE is Unique. The Data set is completely unique & Independent for the Grouping of MACHINE_SERIAL_NUMBER & RECIPE.

Issue#1:( issue arises during model serialization (log_model)):

Converted MACHINE_SERIAL_NUMBER & RECIPE, and set enable_categorical=True before Training. Model trained as Categeircal Variable. However these variables being converted back to Objects during log_model process. So Getting error: Invalid columns:MACHINE_SERIAL_NUMBER: object, RECIPE: object' enable_categorical is set as True-but MLflow is not correctly preserves categorical dtypes when logging the Model

Issue2:(during inference (predict - Real time Inference End point).

JSON Input hitting Inference End point. Question is: how to Pass Categorical Values thru JSON Input? Also It is failing due to Issue#1:

Log file for Error #1 during model serialization (log_model)

WARNING mlflow.utils.requirements_utils: Failed to run predict on input_example, dependencies introduced in predict are not captured. ValueError('DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameterenable_categorical must be set to True. Invalid columns:MACHINE_SERIAL_NUMBER: object, RECIPE: object')Traceback (most recent call last):

12:28:55 WARNING mlflow.models.model: Failed to validate serving input example { "dataframe_split": { "columns": [ "MACHINE_SERIAL_NUMBER", "RECIPE", "Year", "Month", "Day" ], "data": [ [ "BTB0001241", "DECAF", 2024, 2, 9 ] ] } }. Alternatively, you can avoid passing input example and pass model signature instead when logging the model. To ensure the input example is valid prior to serving, please try calling mlflow.models.validate_serving_input on the model uri and serving input example. A serving input example can be generated from model input example using mlflow.models.convert_input_example_to_serving_input function.

Issue#2:Details during inference (predict). ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameterenable_categorical must be set to True. Invalid columns:MACHINE_SERIAL_NUMBER: object, RECIPE: object The above exception was the direct cause of the following exception:

Exampe of Data frame used:

MACHINE_SERIAL_NUMBER RECIPE Volume Year Month Day B1241 COFFE 61.700 2024 4 24 B1241 COFFE 216.139 2024 4 25 B1241 COFFE 184.455 2024 4 26 B1241 COFFE 334.700 2024 4 27 B1241 COFFE 152.100 2024 4 28

##### ****Training & Conversion code****
    import mlflow
    import mlflow.xgboost
    import pandas as pd
    import numpy as np
    from xgboost import XGBRegressor
    from sklearn.model_selection import train_test_split
    from mlflow.models.signature import infer_signature
    import joblib

    # Load Dataset
    df = pd.read_csv("./data.csv")

    # Convert Date column to datetime
    df["Date"] = pd.to_datetime(df["Date"], format="%d-%m-%Y") # Ensure correct format

    # Extract numerical date features
    df["Year"] = df["Date"].dt.year.astype(np.int64)
    df["Month"] = df["Date"].dt.month.astype(np.int64)
    df["Day"] = df["Date"].dt.day.astype(np.int64)

    # Drop original Date column
    df.drop(columns=["Date"], inplace=True)

    # Convert categorical columns
    df["MACHINE_SERIAL_NUMBER"] = df["MACHINE_SERIAL_NUMBER"].astype("category")
    df["RECIPE"] = df["RECIPE"].astype("category")

    # Split into X and y
    y = df["Volume"]
    X = df.drop(columns=["Volume"])

    # Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Train XGBoost Model
    model = XGBRegressor(enable_categorical=True, tree_method="hist")
    model.fit(X_train, y_train)
    joblib.dump(model, "model.pkl")

    # Make Predictions
    y_pred = model.predict(X_test)

    # Step 2: Prepare Input Example
    input_example = pd.DataFrame([{
        'MACHINE_SERIAL_NUMBER': 'BTB0001241',
        'RECIPE': 'DECAF',
        'Year': 2024,
        'Month': 2,
        'Day': 9
    }])

    # Convert input_example to match training data types
    input_example["MACHINE_SERIAL_NUMBER"] = input_example["MACHINE_SERIAL_NUMBER"].astype("category")
    input_example["RECIPE"] = input_example["RECIPE"].astype("category")

    # Infer Signature
    signature = infer_signature(input_example, y_pred)

    #### **    ###### MLflow Logging Code ###**
    with mlflow.start_run():
        mlflow.xgboost.log_model(
            model,
            artifact_path="xgb_model",
            signature=signature,
            input_example=input_example,
            registered_model_name="XGBoost_TimeSeries_Model",
        )
        mlflow.log_params({"enable_categorical": True, "tree_method": "hist"})

    ##// Example JSON used during Inference
    {
        "input_data": {
            "columns": [
                "MACHINE_SERIAL_NUMBER",
                "RECIPE",
                "Year",
                "Month",
                "Day"
            ],
            "data": [
                [
                    "BTB0001241",
                    "DECAF",
                    2025,
                    3,
                    3
                ]
            ]
        }
    }

Solution

  • The issue is that MLflow doesn't preserve pandas categorical dtypes when serializing XGBoost models. This is a known limitation with XGBoost and MLflow.

    The categorical variables (MACHINE_SERIAL_NUMBER and RECIPE) are being converted back to object types during MLflow's model logging, even though you set them as categorical during training and set enable_categorical=True. Because of this, your inference endpoint fails when receiving JSON with these categorical variables.

    Here's how to fix:

    # Create a custom pyfunc model that handles the conversion
    import mlflow.pyfunc
    import pandas as pd
    
    class CategoricalXGBoostWrapper(mlflow.pyfunc.PythonModel):
        def __init__(self, model):
            self.model = model
            
        def predict(self, context, model_input):
            # Convert string columns to categorical
            df = pd.DataFrame(model_input)
            df["MACHINE_SERIAL_NUMBER"] = df["MACHINE_SERIAL_NUMBER"].astype("category")
            df["RECIPE"] = df["RECIPE"].astype("category")
            
            # Make prediction
            return self.model.predict(df)
    
    # Use this wrapper when logging your model
    with mlflow.start_run():
        # Log the model using custom wrapper
        wrapped_model = CategoricalXGBoostWrapper(model)
        
        mlflow.pyfunc.log_model(
            artifact_path="xgb_model",
            python_model=wrapped_model,
            signature=signature,
            input_example=input_example,
            registered_model_name="XGBoost_TimeSeries_Model",
        )
        
        mlflow.log_params({"enable_categorical": True, "tree_method": "hist"})