google-cloud-platform google-cloud-vertex-ai image-classification

Batch Prediction for (Zero-Shot) Image Classification Model on GCP Vertex AI

I'm working on implementing an image classification model (specifically using one of the provided Model Garden models CLIP) hosted on Google Cloud Vertex AI. Following the included Jupyter Notebook I was able to upload and deploy the model and perform online predictions with it. However, I'm facing issues when trying to convert the online prediction to a batch prediction / just performing a batch prediction on an image using this model.

Inside the Jupyter notebook this is the code for the online prediction which consists of JPG images downloaded off the internet, converted to B64 and then formatted into an instances array each consisting of an object with an image field and a text field (for the zero shot classification label).

def image_to_base64(image, format="JPEG"):
    buffer = BytesIO()
    image.save(buffer, format=format)
    image_str = base64.b64encode(buffer.getvalue()).decode("utf-8")
    return image_str

instances = [
    {"image": image_to_base64(image1), "text": "two cats"},
    {"image": image_to_base64(image2), "text": "a bear"},
]

preds = endpoint.predict(instances=instances).predictions

Using the GCP Documentation for performing a batch prediction I made a JSON Lines file using a JPG image file converted to B64 that mimics the instance formatting for the online prediction.

batch_predict.jsonl

{"image": "<B64_OF_JPG_IMAGE>", "text": "rack"}

I then made a request at the uploaded model (I deleted the endpoint used in the Jupyter Notebook since batch predictions do not require the model to be deployed only uploaded to Model Registry) using the following code (lifted from the documentation)

model.batch_predict(
  job_display_name='test-batch-prediction-job',
  instances_format='jsonl',
  machine_type='n1-standard-8',
  accelerator_type="NVIDIA_TESLA_T4",
  accelerator_count=1,
  gcs_source='gs://' + GCS_INPUT_BUCKET + '/batch_predict.jsonl',
  gcs_destination_prefix='gs://' + GCS_BUCKET,
  service_account=SERVICE_ACCOUNT
)

After the batch prediction job finishes there are 2 files in the output folder, the prediction results file downloads blank which I expect is supposed to look something like the example response from a different doc

{
  "instance": {"content": "gs://bucket/image.jpg", "mimeType": "image/jpeg"},
  "prediction": {
    "ids": [1, 2],
    "displayNames": ["cat", "dog"],
    "confidences": [0.7, 0.5]
  }
}

the errors file contains the following message

('Post request fails. Cannot get predictions. Error: Exceeded retries: Non-OK result 503 ({\n "code": 503,\n "type": "InternalServerException",\n "message": "Prediction failed"\n}\n) from server, retry=3, ellapsed=0.07s.', 1)

Which I can't glean much from, the logs similarly do not provide much to act on.

The GCP documentation for JSONL files also mentions that there is a slight formatting difference for PyTorch prebuilt containers (I believe this model counts and I tried both ways) which entails setting the object as the value for a "data" property

batch_predict.jsonl

{"data": {"image": "<B64_OF_JPG_IMAGE>", "text": "rack"}}

This also did not work. I've also tried tweaking assorted other knobs such as changing the "image" property to "b64", switching the B64 for a link to the JPG in a GCloud Storage bucket, making the batch prediction from the GCloud Console and fiddling with the permissions granted to the service account all of which results in similar errors.

Is there something wrong with the way I'm formatting my JSONL file for batch predictions?

Could this error be related to the way the model is set up for batch processing on Vertex AI?

Could this be a problem with the model (CLIP) being unsuitable for batch predictions?

Are there specific settings or configurations I should check on GCP Vertex AI for batch predictions with this kind of model?

Solution

Having gone through extensive trial and error in order to make the Batch Prediction work for the ML Image models provided on GCP Vertex AI's Model Garden I think I've pinned down a few key things to fiddle with if it isn't working:

A given model may simply not work for Batch Predictions even if it works for Online Predictions, if the model doesn't seem to work after much trial and error it would be wise to try with another model. While it's not clear if certain models will never work with Batch Prediction (hard to prove a negative), certain models will more-or-less just work as one would expect them to while others do not work (despite dozens of attempts with every configuration variation I could think of). The models I have managed to make successful Batch Predictions with are ImageBind and OWL-ViT (which is an object detection model rather than image classification), the ones I've tried but was unable to make Batch Predictions with include CLIP, Open-CLIP, and TIMM.
Ensure the input data is formatted correctly, different models accept image data formatted in different ways (E.G. Base64, a GCloud Storage Bucket link, etc.) and have slightly different input formatting (E.G. the image field is called "vision" in Imagebind but "image" in OWL-ViT which does matter), attempting to supply a model with valid image data formatted in a way it doesn't expect will not work. The rule of thumb is probably to use the formatting used by the tutorial Jupyter Notebook included with the model on Model Garden. E.G. The tutorial notebook for Imagebind uses links to images stored in GCloud Storage buckets (and it does not accept other links such as Imgur links or links to public datasets). The JSONL input file contains objects like this {"text": ["car", "cat"], "vision": ["gs://<INPUT_BUCKET_NAME>/cat.jpg"]} While the OWL-ViT tutorial notebook uses images encoded in B64 formatted as {"text": "cat", "image": "<IMAGE_AS_B64>"}
Make sure to attach a Service Account to the Batch Prediction request for both Batch Predictions made from GCloud SDK commands and the GCloud Console (it is possible to omit the Service Account). I have never gotten a Batch Prediction to work with the Service Account missing (even if my personal account has more than enough permissions. This is especially noteworthy when using the GCloud Console as the "Service Account" field is tucked away in the "Advanced Options"

Ensure the Service Account has the correct permissions, it will need the "Storage Object Admin" and "Vertex AI User" roles
Ensure the instance used for the Batch Prediction is powerful enough, if it isn't the prediction may fail. The instance type used by the tutorial notebook should be powerful enough (n1-standard-8 with a NVIDIA_TESLA_T4 accelerator) and is probably a good benchmark. When I downgraded the instance type to n1-standard-2 as the only change for an Imagebind prediction that worked with a more powerful instance it failed. Less powerful instances can still work but if the predictions are not working at all it would be wise to stick to a more powerful one and only tinker with the instance type after getting it to work.
Use an appropriate input file format. According to the Batch Prediction Documentation Batch Predictions should be able to accept input in formats other than JSONL files, the entire set is [JSONL (JSON Lines), TFRecord (Tensorflow Record), CSV (Comma Separated Value), File list, BigQuery]. Even for the models that did work the only input format that worked was JSONL. It could be possible the other input formats could work but I didn’t get them to work and for most of them it doesn’t seem like it would make much sense for an image classification/object detection model to work with them. One of the simplest formats is File list, according to the documentation the input is formatted by "Create a text file where each row is the Cloud Storage URI to a file. Vertex AI reads the contents of each file as binary, then base64- encodes the instance as JSON object with a single key named b64" which omits the prediction label needed to tell the model what to look for. Trying to perform a Batch Prediction using a File list text file containing links to images stored in GCloud Storage gs://path/to/image/image1.jpg but this input was refused by all models, the errors aren’t very specific about what was wrong but it is reasonable to assume the models cannot make a prediction without the text label to tell it what to look for. This probably has to do with other use cases for ML models whose inputs can work with the other input formats (E.G. passing a File list containing links to CSV files containing columns of tabular data to a model which detects patterns in temperature data) just not this one