xgboostgoogle-cloud-ml

How to retain Entity Identifier with Batch Prediction of XGBoost Model in Vertex AI


I am wondering how we can match back predictions to the entity after executing a batch prediction using an XGBoost model via Custom Training on Prebuilt Images.

When kicking off a BatchPredictionJob it expects the input to be of the form

input_1,input_2,input_3
0.1,0.2,0.3
0.4,0.5,0.6
...

for csv or

[0.1,0.2,0.3]
[0.4,0.5,0.6]
...

for jsonl with the output predictions:

{"instance":[0.1,0.2,0.3], "prediction":0.0345}
...

The output predictions then just contain these instances of input values without any indication of how to map these predictions back to the original entity. As the training is distributed I do not believe I can rely on the file ordering, does anyone have a method to do so?


Solution

  • Doing Batch predictions on a model runs the jobs using distributed processing which means the data is distributed among an arbitrary cluster of virtual machines, and is processed in an unpredictable order.

    In the AI platform, to match the returned batch prediction with input instances an instance key needs to be defined. But in Vertex AI this feature has not been documented.

    As the concept of using instance keys with prebuilt XGBoost container image on custom trained models is not mentioned in the Vertex AI docs, this issue has been raised in this issue tracker. We cannot provide an ETA at this moment but you can follow the progress in the issue tracker and you can ‘STAR’ the issue to receive automatic updates and give it traction by referring to this link.

    In Vertex AI the batch prediction outputs are not ordered, for which a feature request has been raised and you can track the update on this request from this link.