SageMaker seems to give examples of using two different serving stacks for serving custom docker images:
Could someone explain to me at a very high level (I have very little knowledge of network engineering) what responsibilities these different components have? And since the second stack has only two components instead of one, can I rightly assume that TensorFlow Serving does the job (whatever that may be) of both Gunicorn and Flask?
Lastly, I've read that it's possible to use Flask and TensorFlow serving at the same time. Would this then be NGINX -> Gunicorn -> Flask -> TensorFlow Serving? And what are there advantages of this?
I'll try to answer your question on a high level. Disclaimer: I'm not at an expert across the full stack of what you describe, and I would welcome corrections or additions from people who are.
I'll go over the different components from bottom to top:
TensorFlow Serving is a library for deploying and hosting TensorFlow models as model servers that accept requests with input data and return model predictions. The idea is to train models with TensorFlow, export them to the SavedModel format and serve them with TF Serving. You can set up a TF Server to accept requests via HTTP and/or RPC. One advantage of RPC is that the request message is compressed, which can be useful when sending large payloads, for instance with image data.
Flask is a python framework for writing web applications. It's much more general-purpose than TF Serving and is widely used to build web services, for instance in microservice architectures.
Now, the combination of Flask and TensorFlow serving should make sense. You could write a Flask web application that exposes an API to the user and calls a TF model hosted with TF Serving under the hood. The user uses the API to transmit some data (1), the Flask app perhaps transform the data (for example, wrap it in numpy arrays), calls the TF Server to get a model prediction (2)(3), perhaps transforms the prediction (for example convert a predicted probability that is larger than 0.5 to a class label of 1), and returns the prediction to the user (4). You could visualize this as follows:
Gunicorn is a Web Server Gateway Interface (WSGI) that is commonly used to host Flask applications in production systems. As the name says, it's the interface between a web server and a web application. When you are developing a Flask app, you can run it locally to test it. In production, gunicorn will run the app for you.
TF Serving will host your model as a functional application. Therefore, you do not need gunicorn to run the TF Server application for you.
Nginx is the actual web server, which will host your application, handle requests from the outside and pass them to the application server (gunicorn). Nginx cannot talk directly to Flask applications, which is why gunicorn is there.
This answer might be helpful as well.
Finally, if you are working on a cloud platform, the web server part will probably be handled for you, so you will either need to write the Flask app and host it with gunicorn, or setup the TF Serving server.