google-cloud-platformtensorflow-servinggoogle-cloud-ai

Issues in scaling GCP AI model TF serving


I have deployed the MNIST dataset on GCP AI models TF serving and facing issues in scaling. I would like to know if someone else has faced a similar issue and ways in which they would have resolved it to scale it.

Behavior

  1. if I create 3 requests per second the model gives the prediction correctly on single core
  2. If increase the no of requests to 1000 per second I get either "code": 403, "message": "Request had insufficient authentication scopes.", or javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake

On another model version I am very sure it was working with java client and its also working from the GCP test and use UI but has stopped working since I tried scaling with 1000/sec. this is on n1-highmem-2 server. Its giving error

 "{
  "error": {
    "code": 400,
    "message": "Request contains an invalid argument.",
    "errors": [
      {
        "message": "Request contains an invalid argument.",
        "domain": "global",
        "reason": "badRequest"
      }
    ],
    "status": "INVALID_ARGUMENT"
  }"

A few questions are be if there is any advantage of tf serving gcp ai for a model vs when deployed on a vm? Thanks for the help


Solution

  • There is a limit on how many online prediction requests per minute you can send. My hypothesis is that you are surpassing that limit of 6000 requests / minute as you are trying to launch 60000, ten times more. Although the error messages are not self explanatory, they probably come from there.

    You can confirm this by checking the quotas page in your GCP console and looking for 'Online prediction requests per minute' under the AI Platform Training & Prediction API service. Fortunately, you have the possibility of increasing some of these limits if you need more scaling power.

    Regarding the advantages of serving your models through AI Platform, the main one is that you don't have to care about the architecture around your VM as it scales automatically when more requests arrive (given that you have set the limits you need for your use case).