I've deployed the Llama 3 model using the Deploy button on the Vertex AI model garden Llama 3 card: https://pantheon.corp.google.com/vertex-ai/publishers/meta/model-garden/llama3
I can make a request using the "Try out Llama 3" side panel on that page & it seems to be working with my deployed model + endpoint. I'd like to try making a request using Curl or python next. The endpoint UI page also has a "sample request" feature, but it's much less helpful / very generic rather than customized.
So does anyone have an example request (for this model or another)?
Specifically for the JSON instances & parameters. Parameters I also may be able to figure out, but I have absolutely no idea what an instance is in this context? This seems like the closest related question: Sending http request Google Vertex AI end point
..Google Cloud loves naming something generically, not giving that many details on what it is, & then expecting something very specific as a value.
edit: Found the docs on this GCP method: https://cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints/predict
which gives some description but "The instances that are the input to the prediction call." is not really that helpful.
Apologies for the poor experience. For now, the best reference is the notebook.
Here's the relevant snippet:
prompt = "What is a car?" # @param {type: "string"}
max_tokens = 50 # @param {type:"integer"}
temperature = 1.0 # @param {type:"number"}
top_p = 1.0 # @param {type:"number"}
top_k = 1.0 # @param {type:"number"}
raw_response = False # @param {type:"boolean"}
# Overides parameters for inferences.
# If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`,
# you can reduce the max length, such as set max_tokens as 20.
instances = [
{
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
"top_k": top_k,
"raw_response": raw_response
}
]
But please note that the full JSON (e.g. to send using curl
) is:
{
"instances": [
{
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
"top_k": top_k,
"raw_response": raw_response
}
]
}