google-cloud-vertex-aillama

Sample request json for Vertex AI endpoint?


I've deployed the Llama 3 model using the Deploy button on the Vertex AI model garden Llama 3 card: https://pantheon.corp.google.com/vertex-ai/publishers/meta/model-garden/llama3 Llama 3 card

I can make a request using the "Try out Llama 3" side panel on that page & it seems to be working with my deployed model + endpoint. I'd like to try making a request using Curl or python next. The endpoint UI page also has a "sample request" feature, but it's much less helpful / very generic rather than customized. endpoint sample

So does anyone have an example request (for this model or another)?

Specifically for the JSON instances & parameters. Parameters I also may be able to figure out, but I have absolutely no idea what an instance is in this context? This seems like the closest related question: Sending http request Google Vertex AI end point

..Google Cloud loves naming something generically, not giving that many details on what it is, & then expecting something very specific as a value.

edit: Found the docs on this GCP method: https://cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints/predict

which gives some description but "The instances that are the input to the prediction call." is not really that helpful.


Solution

  • Apologies for the poor experience. For now, the best reference is the notebook.

    Here's the relevant snippet:

    prompt = "What is a car?"  # @param {type: "string"}
    max_tokens = 50  # @param {type:"integer"}
    temperature = 1.0  # @param {type:"number"}
    top_p = 1.0  # @param {type:"number"}
    top_k = 1.0  # @param {type:"number"}
    raw_response = False  # @param {type:"boolean"}
    
    # Overides parameters for inferences.
    # If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`,
    # you can reduce the max length, such as set max_tokens as 20.
    instances = [
        {
            "prompt": prompt,
            "max_tokens": max_tokens,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k,
            "raw_response": raw_response
        }
    ]
    

    But please note that the full JSON (e.g. to send using curl) is:

    {
      "instances": [
        {
            "prompt": prompt,
            "max_tokens": max_tokens,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k,
            "raw_response": raw_response
        }
      ]
    }