In production deployment of Hugging Face LLMs, the first inference call often has very high latency ("cold start"), even on a machine where the model is already loaded into memory.
Subsequent calls are much faster.
I want to implement a model warm-up strategy that:
pipeline()
-based and model.generate()
-based inferencefrom transformers import pipeline
generator = pipeline('text-generation', model="tiiuae/falcon-7b-instruct", device=0)
def generate_text(prompt):
return generator(prompt, max_new_tokens=50)[0]['generated_text']
My Question:
What is the best way to warm up a HuggingFace Transformers model after loading, to minimize first-token latency in production inference?
You could use a dummy inference immediately after loading the model.
For pipeline:
# Warmup
_ = generator("Warm up prompt", max_new_tokens=1)
For raw model.generate()
:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", device_map="auto", torch_dtype="auto")
# Warmup
inputs = tokenizer("Warm up prompt", return_tensors="pt").to(model.device)
_ = model.generate(**inputs, max_new_tokens=1)