Currently, I am working with a PyTorch model locally using the following code:
from transformers import pipeline
classify_model = pipeline("zero-shot-classification", model='models/zero_shot_4.7.0', device=device)
result = classify_model(text, [label], hypothesis_template=hypothesis)
score = result.scores[0]
I have decided to try deploying this model using TorchServe on Vertex AI, using google documentation, but I have some concerns. For example, the MAR archive essentially just contains my models and tokenizer, and it unpacks every time the container starts, creating a new folder each time and taking up more space. By default, TorchServe loads 5 workers, each of which loads a 2 GB model into memory, totaling 10 GB of RAM. I haven't delved too deeply into it yet, but I believe load balancing is the responsibility of Vertex AI. Please correct me if I'm wrong. Would it be better to create a simple Flask + PyTorch + Transformers container based on an NVIDIA/CUDA image and use it for production? Or do I still need to use TorchServe? In the future, the system should automatically scale and have the tools to handle a hiload. Perhaps there are other approaches in my case that do not involve using a container at all.
I had been struggling with TorchServe for a long time. There was a lot that I was not satisfied with - first of all, it's Java, secondly, the wait time for unpacking the MAR model. All models on the workers were loaded simultaneously, causing some workers to fail, and I couldn't see the resource usage of each worker. The last straw was that I couldn't deploy multiple models in one container in such a way that I could use this container in Vertex AI. After that, I decided to write my own version of TorchServe in Golang, which is significantly more agile, lightweight and devoid of all the drawbacks of TorchServe. Now I use the base image of Model Hub for all my models for Vertex AI, and I no longer have any problems. I'm happy to share my development.