The concurrency setting of Applications and how this setting can be used to optimise latency and throughput
IBM Cloud Code Engine is a fully managed, serverless platform that runs your containerized workloads, including web apps, microservices, event-driven functions, and batch jobs with run-to-completion characteristics. IBM Cloud Code Engine provides two compute models: Jobs and Applications. You can learn about these models by reading "IBM Cloud Code Engine: When to Use Jobs and Applications."
In this blog post, we will learn about the concurrency setting of Applications and how this setting can be used to optimise latency and throughput.
Concurrency determines the number of simultaneous requests that can be processed by each instance of an application at any given time (see the official Knative documentation for more information).
To control the concurrency-per-application revision, the user can set the concurrency value in the runtime section of the application details page. In the CLI, the user can configure the concurrency using the
--concurrency flag when creating or updating applications with
ic ce app create/update using the IBM Code Engine CLI. The API specification allows you to set the containerConcurrency on the revision template (see the revision specification documentation).
Setting the container concurrency (cc) configuration enforces an upper bound of requests to be processed within an application instance. If concurrency reaches this limit, subsequent requests will be buffered and must wait until enough capacity is free to execute the requests. Additional capacity might get freed up through the completion of requests or the scaling up of additional application instances.
How the scaling of Applications works behind the scenes
The autoscaler (powered by Knative) observes the number of requests in the system and scales the application instances up and down in order to meet the user's concurrency setting. Especially, the autoscaler can scale applications to zero when no requests are reaching the application. In this case, no instance would be running and no costs would be incurred. If scaled to zero and a request is routed to the application, the autoscaler will scale-up the application from zero and route the request to the newly created application instance. Therefore, the system has an internal buffer to queue the requests until the application instance is ready to serve the requests.
Internally, the autoscaler introduces a 60s sliding window and scales the application to meet the concurrency on average over that sliding window. Since the request rate can be very dynamic and can change significantly (i.e., a burst of requests), the autoscaler already scales up when 70% of the container concurrency is observed (internal configuration), and vice versa. In other words, if the user specifies a container concurrency of 10, the autoscaler will add an additional application instance when 7 requests on average are observed over the stable window period of 60s.
In case of a very significant increase in the request rate, the autoscaler will enter a panic mode. During the panic mode, the autoscaler's feedback loop is much shorter (6s sliding window) and the scaling policy is much more aggressive (i.e., it will scale up more quickly in order to meet the 70% of container concurrency within the 6s panic window). The autoscaler enters the panic mode when 200% of the container concurrency is observed (internal configuration). In other words, if the user configured a container concurrency of 10, the panic mode will be entered when 20 requests are observed in the system.
It is possible to configure scaling boundaries for the autoscaler, by using
--max-scale flags when creating or updating applications using
ic ce app create/update by adding the following two annotations:
- autoscaling.knative.dev/minScale (
--min-scale): The minimum number of application instances to keep running. When set to 0 (default) the autoscaler will remove all instances when no traffic is reaching the application.
- autoscaling.knative.dev/maxScale (
--max-scale): The maximum number of application instances that will be running. The autoscaler does not scale up beyond that value.
How to optimise latency and throughput
The following sections explain a few examples and best practices to configure the container concurrency (cc):
- Single-concurrency, cc=1: A developer should choose the single-concurrency model when the application serves a memory- or CPU-intensive workload because only one request would enter the application instance at a time and would therefore get the full amount of CPU and memory configured for the instance. Requests would not compete on resources at the same point in time. A drawback of the single-concurrency model is that the application is scaling out more quickly. The scale-out might introduce additional latency and lower throughput, because it's more expensive to create a new application instance than reuse an existing one. Therefore, a developer should NOT choose this model if requests can be processed concurrently and latency is a critical aspect of the application.
- High-concurrency, cc=100 (default) or higher: A developer should choose this configuration when the application serves a high volume of http request/response workloads, where requests are not CPU- or memory intensive and requests wait for I/O. For example, an API backend that reads and writes data on CRUD operations to a remote database. While some requests wait on I/O, other requests can be processed without impacting overall latency and throughput. This setting is NOT optimal when concurrent requests would compete on CPU, memory, or I/O because that would delay execution and impact latency and throughput negatively.
- Optimal-concurrency, cc=N: Some application developers have a very good understanding of the resource requirements of their applications and, therefore, know the amount of resources required for a single request to meet the desired response time for the application. A typical example is a natural language translation application, where the machine learning model for the language translation is 32 GB, and a single translation computation needs about 0.7 vCPU per request. The developer could choose a configuration of 9 vCPUs and 32GB of memory per instance. The optimal container concurrency would be about 13 (9 vCPU/0.7 vCPU). Be careful of setting arbitrary values when the behaviour is not exactly known and understood. The wrong container concurrency can lead to either too-aggressive or too-lazy scaling, which may impact the latency, error rate, and costs of your application. Use the steps below to determine an optimal container concurrency value.
- Infinite-concurrency, cc=0 (disabled): This is only documented for the sake of completeness, since Knative supports this setting and users might expect it to be supported in IBM Cloud Code Engine as well, which is not the case. The setting will try to forward as many requests as possible to a single application instance, which will delay scaling of additional application instances. In various tests and analysis, we have seen a higher error rate and higher latency. We, therefore, disabled this setting in IBM Cloud Code Engine to protect users from unexpected behaviour.
How to determine the container concurrency for your container
The container concurrency (cc) has a direct impact on the success rate, latency, and throughput of the application (as we have seen above). When the container concurrency value is too high for the application to handle, the client will see a negative impact on latency and throughput and might even observe 502 and 503 error responses, temporarily.
The same can happen when the container concurrency value is too low for the application, because that would cause the system to scale out the application more quickly and spread the request across many application instances. This might introduce additional costs and latency overhead, as well. During a burst of load, this could also lead to temporary 502 responses when the internal buffers of the system run over.
The optimal container concurrency value is determined by the maximum number of concurrent requests the application can handle with an acceptable request latency.
The following procedure can be used to approximate a good container concurrency value for the application:
- Create an application and set its cc=1000 (max) and both minScale and maxScale to 1.
- Use a load-generation tool like vegeta or wrk to generate a load against the application. At first, send requests with a high rate. If there are 502 errors, then decrease the rate until the result shows a 100% success rate.
- Now, take the request latency of the output of Step 2. If the request latency is not acceptable, further decrease the request rate until the request latency looks acceptable. Note that the request duration is playing an important role (i.e., it makes a big difference if the computation of the request takes 2s or 100ms).
- To calculate the container concurrency value for your application, take the RATE from Step 2 (in req/s) and divide by the LATENCY of Step 3 (in s): CC = RATE/LATENCY. For example, if the rate is 80 req/s and the latency is 2s, the resulting concurrency is CC = 80 req/s / 2s = 40.
- Now update the application to set the container concurrency to the value we get from the previous step (40), and rerun the workload to check if the success rate and latency are acceptable.
- Experiment with the application by setting the container concurrency to a slightly larger value and see if it can still get an acceptable success rate and latency.
- Finally, we get the optimal container concurrency value and can remove the minScale and maxScale boundaries to allow the application to scale automatically.
Learn more and try it out
Ready to give it a try? Head on over to the “Getting started” section of our documentation.
Be sure to try IBM Cloud Code Engine out today (it’s completely free while in beta).