I observed an issue where categorical variables are being converted back to strings, causing failures. The issue arises during model serialization (log_model
) – even though enable_categorical=True
is set consistently across training, logging, and inference. MLflow is unable to correctly preserve categorical dtypes when logging the model. Please suggest a solution.
Requirement:
XGBoost model for time series prediction without encoding. We should use categorical variables across training, logging, and inference. We have more than 200+ unique combinations of category variables.
Dataset:
Contains 4 columns:
MACHINE_SERIAL_NUMBER
RECIPE
Date
Volume
The grouping of MACHINE_SERIAL_NUMBER
& RECIPE
is unique. The dataset is completely unique and independent for each grouping of MACHINE_SERIAL_NUMBER
& RECIPE
.
Issue #1 (during model serialization log_model
):
The columns MACHINE_SERIAL_NUMBER
and RECIPE
are converted to categorical variables and enable_categorical=True
is set before training. However, these variables are being converted back to objects during the log_model
process. This results in the error:
ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter enable_categorical must be set to True. Invalid columns: MACHINE_SERIAL_NUMBER: object, RECIPE: object
Issue #2 (during inference - real time endpoint):
When a JSON input hits the inference endpoint, the question is how to pass categorical values through JSON input. This also fails due to Issue #1.
Log file for Error #1 during model serialization (log_model
):
WARNING mlflow.utils.requirements_utils: Failed to run predict on input_example, dependencies introduced in predict are not captured. ValueError('DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter enable_categorical must be set to True. Invalid columns:MACHINE_SERIAL_NUMBER: object, RECIPE: object')
Traceback (most recent call last):
12:28:55 WARNING mlflow.models.model: Failed to validate serving input example { "dataframe_split": { "columns": [ "MACHINE_SERIAL_NUMBER", "RECIPE", "Year", "Month", "Day" ], "data": [ [ "BTB0001241", "DECAF", 2024, 2, 9 ] ] } }. Alternatively, you can avoid passing input example and pass model signature instead when logging the model. To ensure the input example is valid prior to serving, please try calling `mlflow.models.validate_serving_input` on the model uri and serving input example. A serving input example can be generated from model input example using `mlflow.models.convert_input_example_to_serving_input` function.
Issue #2 Details (during inference predict
):
ValueError: DataFrame.dtypes for data must be int, float, bool or category. When categorical type is supplied, the experimental DMatrix parameter enable_categorical must be set to True. Invalid columns: MACHINE_SERIAL_NUMBER: object, RECIPE: object
The above exception was the direct cause of the following exception:
Example of DataFrame used:
MACHINE_SERIAL_NUMBER | RECIPE | Volume | Year | Month | Day |
---|---|---|---|---|---|
B1241 | COFFE | 61.700 | 2024 | 4 | 24 |
B1241 | COFFE | 216.139 | 2024 | 4 | 25 |
B1241 | COFFE | 184.455 | 2024 | 4 | 26 |
B1241 | COFFE | 334.700 | 2024 | 4 | 27 |
B1241 | COFFE | 152.100 | 2024 | 4 | 28 |
import mlflow
import mlflow.xgboost
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from mlflow.models.signature import infer_signature
import joblib
# Load Dataset
df = pd.read_csv("./data.csv")
# Convert Date column to datetime
df["Date"] = pd.to_datetime(df["Date"], format="%d-%m-%Y") # Ensure correct format
# Extract numerical date features
df["Year"] = df["Date"].dt.year.astype(np.int64)
df["Month"] = df["Date"].dt.month.astype(np.int64)
df["Day"] = df["Date"].dt.day.astype(np.int64)
# Drop original Date column
df.drop(columns=["Date"], inplace=True)
# Convert categorical columns
df["MACHINE_SERIAL_NUMBER"] = df["MACHINE_SERIAL_NUMBER"].astype("category")
df["RECIPE"] = df["RECIPE"].astype("category")
# Split into X and y
y = df["Volume"]
X = df.drop(columns=["Volume"])
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train XGBoost Model
model = XGBRegressor(enable_categorical=True, tree_method="hist")
model.fit(X_train, y_train)
joblib.dump(model, "model.pkl")
# Make Predictions
y_pred = model.predict(X_test)
# Prepare Input Example
input_example = pd.DataFrame([{
'MACHINE_SERIAL_NUMBER': 'BTB0001241',
'RECIPE': 'DECAF',
'Year': 2024,
'Month': 2,
'Day': 9
}])
# Convert input_example to match training data types
input_example["MACHINE_SERIAL_NUMBER"] = input_example["MACHINE_SERIAL_NUMBER"].astype("category")
input_example["RECIPE"] = input_example["RECIPE"].astype("category")
# Infer Signature
signature = infer_signature(input_example, y_pred)
# MLflow Logging Code
with mlflow.start_run():
mlflow.xgboost.log_model(
model,
artifact_path="xgb_model",
signature=signature,
input_example=input_example,
registered_model_name="XGBoost_TimeSeries_Model",
)
mlflow.log_params({"enable_categorical": True, "tree_method": "hist"})
{
"input_data": {
"columns": [
"MACHINE_SERIAL_NUMBER",
"RECIPE",
"Year",
"Month",
"Day"
],
"data": [
[
"BTB0001241",
"DECAF",
2025,
3,
3
]
]
}
}
The issue is that MLflow
doesn't preserve pandas
categorical dtypes
when serializing XGBoost
models. This is a known limitation with XGBoost
and MLflow
.
The categorical variables (MACHINE_SERIAL_NUMBER
and RECIPE
) are being converted back to object
types during MLflow
's model logging, even though you set them as categorical during training and set enable_categorical=True
. Because of this, your inference endpoint fails when receiving JSON
with these categorical variables.
Here's how to fix:
# Create a custom pyfunc model that handles the conversion
import mlflow.pyfunc
import pandas as pd
class CategoricalXGBoostWrapper(mlflow.pyfunc.PythonModel):
def __init__(self, model):
self.model = model
def predict(self, context, model_input):
# Convert string columns to categorical
df = pd.DataFrame(model_input)
df["MACHINE_SERIAL_NUMBER"] = df["MACHINE_SERIAL_NUMBER"].astype("category")
df["RECIPE"] = df["RECIPE"].astype("category")
# Make prediction
return self.model.predict(df)
# Use this wrapper when logging your model
with mlflow.start_run():
# Log the model using custom wrapper
wrapped_model = CategoricalXGBoostWrapper(model)
mlflow.pyfunc.log_model(
artifact_path="xgb_model",
python_model=wrapped_model,
signature=signature,
input_example=input_example,
registered_model_name="XGBoost_TimeSeries_Model",
)
mlflow.log_params({"enable_categorical": True, "tree_method": "hist"})