I'm using Vertex AI's TextEmbeddingModel to calculate embeddings, and the first call shows significantly higher latency than the rest, likely due to caching. However, this isn't context-caching, and sdk_encode is being called one at a time. How can I warm up the system to reduce the initial delay?
import time
import google.generativeai as genai
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
base_model_name = "text-embedding-004"
EMBED_TASK_TYPE = "RETRIEVAL_QUERY"
text_embedding_model = TextEmbeddingModel.from_pretrained(base_model_name)
model = TextEmbeddingModel.from_pretrained(base_model_name)
def sdk_encode( text):
inputs = [TextEmbeddingInput(text.lower(), EMBED_TASK_TYPE) ]
kwargs = {}
embeddings = model.get_embeddings(inputs, **kwargs)
text_embeddings = [embedding.values for embedding in embeddings]
return text_embeddings[0] if len(text_embeddings) == 1 else text_embeddings
queries = ["I want to take pto Monday", "I want to take pto Tuesday", "I want to take pto Friday"]
for i in range(3):
query = queries[i]
start_time = time.time()
sdk_encode(query)
end_time = time.time()
sdk_delay = end_time - start_time
print(f"Vertext SDK Latency for {query}: {sdk_delay}")
Output:
Vertext SDK Latency for I want to take pto Monday: 1.3506088256835938
Vertext SDK Latency for I want to take pto Tuesday: 0.12767696380615234
Vertext SDK Latency for I want to take pto Friday: 0.12481999397277832
I have not been able to replicate your finding (that the first call to the embedding model takes longer), I am afraid. I tried it in a Colab notebook, for reference.
However, I have noticed that there doesn't seem to be any reason to loop over the queries in the way that you are doing. I found that when you are embedding just three strings, almost all the latency seems to come from the repeated calls.
From experimentation, nothing is stopping you from sending all your queries to VertexAI as part of the same statement. Judging from the fact that the response comes back no slower than any of your individual function calls, the library must be batching queries up and handling them simultaneously.
Try this:
import time
import google.generativeai as genai
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
model = TextEmbeddingModel.from_pretrained("text-embedding-004")
inputs = [TextEmbeddingInput(text.lower(), "RETRIEVAL_QUERY") for text in queries]
st = time.time()
embeddings = [e.values for e in model.get_embeddings(inputs)]
et = time.time()
print(f"Vertex SDK Latency: {et - st}")
Presumably, as you add more strings to the batch you will see slowdowns (as latency starts to become dominated by processing time, rather than I/O).
If this becomes a problem it may be worth experimenting with different VertexAI pretrained models - some may be faster at generating embeddings than others.