Limit on the rate of inferences one can make for a SageMaker endpoint

Is there a limit on the rate of inferences one can make for a SageMaker endpoint?

Is it determined somehow by the instance type behind the endpoint or the number of instances?

I tried looking for this info as AWS Service Quotas for SageMaker but couldn't find it.

I am invoking the endpoint from a Spark job abd wondered if the number of concurrent tasks is a factor I should be taking care of when running inference (assuming each task runs one inference at a time)

Here's the throttling error I got:

com.amazonaws.services.sagemakerruntime.model.AmazonSageMakerRuntimeException: null (Service: AmazonSageMakerRuntime; Status Code: 400; Error Code: ThrottlingException; Request ID: b515121b-f3d5-4057-a8a4-6716f0708980)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
    at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.doInvoke(AmazonSageMakerRuntimeClient.java:236)
    at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.invoke(AmazonSageMakerRuntimeClient.java:212)
    at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.executeInvokeEndpoint(AmazonSageMakerRuntimeClient.java:176)
    at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.invokeEndpoint(AmazonSageMakerRuntimeClient.java:151)
    at lineefd06a2d143b4016906a6138a6ffec15194.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$a5cddfc4633c5dd8aa603ddc4f9aad5$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Predictor.predict(command-2334973:41)
    at lineefd06a2d143b4016906a6138a6ffec15200.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$50a9225beeac265557e61f69d69d7d$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(command-2307906:11)
    at lineefd06a2d143b4016906a6138a6ffec15200.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$50a9225beeac265557e61f69d69d7d$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(command-2307906:11)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:2000)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1220)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1220)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2321)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2321)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
    at org.apache.spark.scheduler.Task.run(Task.scala:113)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:533)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:539)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Solution

Amazon SageMaker is offering model hosting service (https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html), which gives you a lot of flexibility based on your inference requirements.

As you noted, first you can choose the instance type to use for your model hosting. The large set of options is important to tune to your models. You can host the model on a GPU based machines (P2/P3/P4) or CPU ones. You can have instances with faster CPU (C4, for example), or more RAM (R4, for example). You can also choose instances with more cores (16xl, for example) or less (medium, for example). Here is a list of the full range of instances that you can choose: https://aws.amazon.com/sagemaker/pricing/instance-types/ . It is important to balance your performance and costs. The selection of the instance type and the type and size of your model will determine the invocations-per-second that you can expect from your model in this single-node configuration. It is important to measure this number to avoid hitting the throttle errors that you saw.

The second important feature of the SageMaker hosting that you use is the ability to auto-scale your model to multiple instances. You can configure the endpoint of your model hosting to automatically add and remove instances based on the load on the endpoint. AWS is adding a load balancer in front of the multiple instances that are hosting your models and distributing the requests among them. Using the autoscaling functionality allows you to keep a smaller instance for low traffic hours, and to be able to scale up during peak traffic hours, and still keep your costs low and your throttle errors to the minimum. See here for documentation on the SageMaker autoscaling options: https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html