On Google Cloud App Engine and Google Cloud Run, what is the best way to determine the value for maximum concurrent requests per instance? The default for App Engine max_concurrent_requests
is 10
and the default for Cloud Run is 80
.
Ideally, you would want as large a value as possible to minimize additional instances from being spun up.
Is there a tool you can use to see how many requests you are averaging per instance and base it on that?
Or alternatively, would it be better to set maximum concurrent requests to the maximum value of 1000
and then manage scaling based solely on CPU usage? On App Engine this would be setting the target_cpu_utilization
value.
I personally run the service with my own feeling. Then, I observe the golden metrics (CPU usage, latency, memory) to be sure that 1 instance is strong enough for typical traffic.
Another metrics very important is the cold start.
At financial perspective, it's more interesting to have small instance because every addition incur only a small additional resources (and cost). If you have a BIG instance, each scale out increment will cost a lot!
It's an optimization over constraint, and the best way is to test and experiment.