XG Boost Categorical Issues during Time Series Prediction-: I observed an Issue- Categorical Variables are being converted back to String, causing failures. The issue arises during model serialization (log_model) - though enable_categorical=True is set consistently across training, logging, and inference MLflow is unable to correctly preserving categorical dtypes when logging the model. Please suggest Solution.
Details:
Requirement is: XG Model for Time Series Prediction without Encoding. So We should use Category variables across training, logging, and inference. We have more than 200 + Unique Combinations of Category Variables.
Data Set contains 4 Columns: MACHINE_SERIAL_NUMBER, RECIPE, Date, Volume. Grouping of MACHINE_SERIAL_NUMBER & RECIPE is Unique. The Data set is completely unique & Independent for the Grouping of MACHINE_SERIAL_NUMBER & RECIPE.
Issue#1:( issue arises during model serialization (log_model)):
Converted MACHINE_SERIAL_NUMBER & RECIPE, and set enable_categorical=True before Training. Model trained as Categeircal Variable. However these variables being converted back to Objects during log_model process. So Getting error: Invalid columns:MACHINE_SERIAL_NUMBER: object, RECIPE: object' enable_categorical is set as True-but MLflow is not correctly preserves categorical dtypes when logging the Model
Issue2:(during inference (predict - Real time Inference End point).
JSON Input hitting Inference End point. Question is: how to Pass Categorical Values thru JSON Input? Also It is failing due to Issue#1:
Log file for Error #1 during model serialization (log_model)
WARNING mlflow.utils.requirements_utils: Failed to run predict on input_example, dependencies introduced in predict are not captured. ValueError('DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameterenable_categorical
must be set to True
. Invalid columns:MACHINE_SERIAL_NUMBER: object, RECIPE: object')Traceback (most recent call last):
12:28:55 WARNING mlflow.models.model: Failed to validate serving input example { "dataframe_split": { "columns": [ "MACHINE_SERIAL_NUMBER", "RECIPE", "Year", "Month", "Day" ], "data": [ [ "BTB0001241", "DECAF", 2024, 2, 9 ] ] } }. Alternatively, you can avoid passing input example and pass model signature instead when logging the model. To ensure the input example is valid prior to serving, please try calling mlflow.models.validate_serving_input
on the model uri and serving input example. A serving input example can be generated from model input example using mlflow.models.convert_input_example_to_serving_input
function.
Issue#2:Details during inference (predict).
ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameterenable_categorical
must be set to True
. Invalid columns:MACHINE_SERIAL_NUMBER: object, RECIPE: object
The above exception was the direct cause of the following exception:
Exampe of Data frame used:
MACHINE_SERIAL_NUMBER RECIPE Volume Year Month Day B1241 COFFE 61.700 2024 4 24 B1241 COFFE 216.139 2024 4 25 B1241 COFFE 184.455 2024 4 26 B1241 COFFE 334.700 2024 4 27 B1241 COFFE 152.100 2024 4 28
##### ****Training & Conversion code****
import mlflow
import mlflow.xgboost
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from mlflow.models.signature import infer_signature
import joblib
# Load Dataset
df = pd.read_csv("./data.csv")
# Convert Date column to datetime
df["Date"] = pd.to_datetime(df["Date"], format="%d-%m-%Y") # Ensure correct format
# Extract numerical date features
df["Year"] = df["Date"].dt.year.astype(np.int64)
df["Month"] = df["Date"].dt.month.astype(np.int64)
df["Day"] = df["Date"].dt.day.astype(np.int64)
# Drop original Date column
df.drop(columns=["Date"], inplace=True)
# Convert categorical columns
df["MACHINE_SERIAL_NUMBER"] = df["MACHINE_SERIAL_NUMBER"].astype("category")
df["RECIPE"] = df["RECIPE"].astype("category")
# Split into X and y
y = df["Volume"]
X = df.drop(columns=["Volume"])
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train XGBoost Model
model = XGBRegressor(enable_categorical=True, tree_method="hist")
model.fit(X_train, y_train)
joblib.dump(model, "model.pkl")
# Make Predictions
y_pred = model.predict(X_test)
# Step 2: Prepare Input Example
input_example = pd.DataFrame([{
'MACHINE_SERIAL_NUMBER': 'BTB0001241',
'RECIPE': 'DECAF',
'Year': 2024,
'Month': 2,
'Day': 9
}])
# Convert input_example to match training data types
input_example["MACHINE_SERIAL_NUMBER"] = input_example["MACHINE_SERIAL_NUMBER"].astype("category")
input_example["RECIPE"] = input_example["RECIPE"].astype("category")
# Infer Signature
signature = infer_signature(input_example, y_pred)
#### ** ###### MLflow Logging Code ###**
with mlflow.start_run():
mlflow.xgboost.log_model(
model,
artifact_path="xgb_model",
signature=signature,
input_example=input_example,
registered_model_name="XGBoost_TimeSeries_Model",
)
mlflow.log_params({"enable_categorical": True, "tree_method": "hist"})
##// Example JSON used during Inference
{
"input_data": {
"columns": [
"MACHINE_SERIAL_NUMBER",
"RECIPE",
"Year",
"Month",
"Day"
],
"data": [
[
"BTB0001241",
"DECAF",
2025,
3,
3
]
]
}
}
The issue is that MLflow
doesn't preserve pandas
categorical dtypes
when serializing XGBoost
models. This is a known limitation with XGBoost
and MLflow
.
The categorical variables (MACHINE_SERIAL_NUMBER
and RECIPE
) are being converted back to object
types during MLflow
's model logging, even though you set them as categorical during training and set enable_categorical=True
. Because of this, your inference endpoint fails when receiving JSON
with these categorical variables.
Here's how to fix:
# Create a custom pyfunc model that handles the conversion
import mlflow.pyfunc
import pandas as pd
class CategoricalXGBoostWrapper(mlflow.pyfunc.PythonModel):
def __init__(self, model):
self.model = model
def predict(self, context, model_input):
# Convert string columns to categorical
df = pd.DataFrame(model_input)
df["MACHINE_SERIAL_NUMBER"] = df["MACHINE_SERIAL_NUMBER"].astype("category")
df["RECIPE"] = df["RECIPE"].astype("category")
# Make prediction
return self.model.predict(df)
# Use this wrapper when logging your model
with mlflow.start_run():
# Log the model using custom wrapper
wrapped_model = CategoricalXGBoostWrapper(model)
mlflow.pyfunc.log_model(
artifact_path="xgb_model",
python_model=wrapped_model,
signature=signature,
input_example=input_example,
registered_model_name="XGBoost_TimeSeries_Model",
)
mlflow.log_params({"enable_categorical": True, "tree_method": "hist"})