[SOLVED] Warm up HuggingFace Transformers models efficiently to reduce first-token latency in production

Warm up HuggingFace Transformers models efficiently to reduce first-token latency in production

In production deployment of Hugging Face LLMs, the first inference call often has very high latency ("cold start"), even on a machine where the model is already loaded into memory.

Subsequent calls are much faster.

I want to implement a model warm-up strategy that:

Primes the model and GPU memory before real user requests arrive
Reduces first-token generation time for users
Works for both pipeline()-based and model.generate()-based inference

from transformers import pipeline

generator = pipeline('text-generation', model="tiiuae/falcon-7b-instruct", device=0)

def generate_text(prompt):
    return generator(prompt, max_new_tokens=50)[0]['generated_text']

My Question:

What is the best way to warm up a HuggingFace Transformers model after loading, to minimize first-token latency in production inference?

Solution

You could use a dummy inference immediately after loading the model.

For pipeline:

# Warmup 
_ = generator("Warm up prompt", max_new_tokens=1)

For raw model.generate():

from transformers import AutoTokenizer, AutoModelForCausalLM  
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct") 
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", device_map="auto", torch_dtype="auto")  

# Warmup 
inputs = tokenizer("Warm up prompt", return_tensors="pt").to(model.device) 
_ = model.generate(**inputs, max_new_tokens=1)