embeddinggoogle-cloud-vertex-aigoogle-geminivertex-ai-pipelinegoogle-gemini-context-caching

Latency issue using TextEmbeddingModel


I'm using Vertex AI's TextEmbeddingModel to calculate embeddings, and the first call shows significantly higher latency than the rest, likely due to caching. However, this isn't context-caching, and sdk_encode is being called one at a time. How can I warm up the system to reduce the initial delay?

    import time
    import google.generativeai as genai
    from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
    base_model_name = "text-embedding-004"
    EMBED_TASK_TYPE = "RETRIEVAL_QUERY"
    
    text_embedding_model = TextEmbeddingModel.from_pretrained(base_model_name)
    model = TextEmbeddingModel.from_pretrained(base_model_name)
    
    def sdk_encode( text):
            inputs = [TextEmbeddingInput(text.lower(), EMBED_TASK_TYPE) ]
            kwargs = {}
            embeddings = model.get_embeddings(inputs, **kwargs)
            text_embeddings = [embedding.values for embedding in embeddings]
            return text_embeddings[0] if len(text_embeddings) == 1 else text_embeddings
    
    queries = ["I want to take pto Monday", "I want to take pto Tuesday", "I want to take pto Friday"]
    for i in range(3): 
        query = queries[i]
        start_time = time.time()
        sdk_encode(query)
        end_time = time.time()
        sdk_delay = end_time - start_time
        print(f"Vertext SDK Latency for {query}: {sdk_delay}")

Output:

Vertext SDK Latency for I want to take pto Monday: 1.3506088256835938
Vertext SDK Latency for I want to take pto Tuesday: 0.12767696380615234
Vertext SDK Latency for I want to take pto Friday: 0.12481999397277832

Solution

  • I have not been able to replicate your finding (that the first call to the embedding model takes longer), I am afraid. I tried it in a Colab notebook, for reference.

    However, I have noticed that there doesn't seem to be any reason to loop over the queries in the way that you are doing. I found that when you are embedding just three strings, almost all the latency seems to come from the repeated calls.

    From experimentation, nothing is stopping you from sending all your queries to VertexAI as part of the same statement. Judging from the fact that the response comes back no slower than any of your individual function calls, the library must be batching queries up and handling them simultaneously.

    Try this:

    import time
    import google.generativeai as genai
    from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
    
    model = TextEmbeddingModel.from_pretrained("text-embedding-004")
    
    inputs = [TextEmbeddingInput(text.lower(), "RETRIEVAL_QUERY") for text in queries]
    
    st = time.time()
    embeddings = [e.values for e in model.get_embeddings(inputs)]
    et = time.time()
    
    print(f"Vertex SDK Latency: {et - st}")
    

    Presumably, as you add more strings to the batch you will see slowdowns (as latency starts to become dominated by processing time, rather than I/O).

    If this becomes a problem it may be worth experimenting with different VertexAI pretrained models - some may be faster at generating embeddings than others.