[SOLVED] What is the best way to determine the maximum concurrent requests per instance for your App Engine or Cloud Run app?

What is the best way to determine the maximum concurrent requests per instance for your App Engine or Cloud Run app?

On Google Cloud App Engine and Google Cloud Run, what is the best way to determine the value for maximum concurrent requests per instance? The default for App Engine max_concurrent_requests is 10 and the default for Cloud Run is 80.

Ideally, you would want as large a value as possible to minimize additional instances from being spun up.

Is there a tool you can use to see how many requests you are averaging per instance and base it on that?

Or alternatively, would it be better to set maximum concurrent requests to the maximum value of 1000 and then manage scaling based solely on CPU usage? On App Engine this would be setting the target_cpu_utilization value.

Solution

I personally run the service with my own feeling. Then, I observe the golden metrics (CPU usage, latency, memory) to be sure that 1 instance is strong enough for typical traffic.

Another metrics very important is the cold start.

If your app starts very fast, you can offer the possibility to create many (small) instances, up to 1000
If your instance starts slowly, and the latency is a critical aspect, it's better to maximise the traffic on a single instance with a maximum CPU/memory and a consistent concurrency related to this max config.

At financial perspective, it's more interesting to have small instance because every addition incur only a small additional resources (and cost). If you have a BIG instance, each scale out increment will cost a lot!

It's an optimization over constraint, and the best way is to test and experiment.