Managing the model gateway (preview)
Manage existing connections and models, enable load balancing to distribute traffic efficiently across multiple models, create access policies to define which groups can access specific resources, and set rate limits to control and prevent request overload.
Requirements
- To manage the model gateway, you must have one of the following permissions:
-
- Administrator platform
- Manage configurations
Managing existing connections and models
To manage existing connections and models:
-
Select the Model provider tab. The list of connections opens.
-
Use the search input field to find one of the connections and then:
- Click the Edit icon and then select either Edit credentials or Edit models.
- Click the Delete icon.
Managing existing connections and models
If a provider was created by using API then the secret that is associated with this provider is not populated automatically. You must manually select the secret that was used when creating the provider. Secrets that were created by using
API have this format: mg-<connection name>-<six random characters>.
Managing access policies
By default, the model gateway can communicate with all model providers. Restrict access to only the providers that you want to use. Use policies to keep providers, models, and load balancers scoped to intended user groups and prevent unintended exposure.
Adding access policies from the model gateway UI
To add an access policy from the model gateway UI:
- Select the Access control tab and then click Assign access.
- Click Select access group and then select one of the available access groups. If you don't have any access groups set up, see Setting up IAM access groups.
- From the Resource type menu, select
ModelorLoad balancer. - Click Resource and select the resource that the access policy applies to.
- Select the action and permission type and then click Create.
The added access policies are listed in the Access control tab. Use the search input field to find them and the Sort menu to sort them by using the avaiable criteria.
You can also edit existing access policies.
Some configuration options for model gateway might be accessible only programmatically and have effects that apply only to programmatic access.
Adding access policies programmatically
Create an access policy that grants the group that is specified in the subject parameter permission to access a resource. Resources can be the following:
model:<uuid>provider:<uuid>loadbalancer:<uuid>
To manage policies, use policy with action read or write. For example, the following command grants read access to the model that you specify by using the uuid:
curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/policies" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"subject": "<group_id>",
"resource": "model:<uuid>",
"action": "read",
"effect": "allow"
}'
To view created policies, run the following command:
curl -sS -H "Authorization: Bearer ${TOKEN}" "https://<region>.<cloud-provider-domain>/ml/gateway/v1/policies"
Managing load balancers
Enable load balancing to distribute inference requests across multiple backend instances. Use a single stable alias while scaling backend capacity to handle high traffic.
Creating a load balancer from the model gateway UI
To create a load balancer from the model gateway UI:
- Select the Rules tab.
- Inside the Rules tab, select the Load balancer tab and then click Create load balancer.
- Type the name for your new load balancer and then select the balancer type:
- Round Robin: Distributes requests evenly across all models in sequence
- Least Connections: Routes requests to the model with the fewest active connections
- Weighted Round Robin: Distributes requests based on assigned weights
- Quota Priority: Routes based on quota limits and priority levels
- Click Select models to select the models that the new load balancer applies to.
- Optional: Set additional parameters if required by the selected balancer type.
- Click Create to create a new load balancer.
The added load balancers are listed in the Load balancers tab. Use the search input field to find them and the Sort menu to sort them by using the avaiable criteria.
Modifying an existing load balancer
To modify an existing load balancer, in the Load balancer tab, use the search input field to find the load balancer entry and then:
- Click Edit, make your updates, and then click Update.
- Click Delete.
Some configuration options for model gateway might be accessible only programmatically and have effects that apply only to programmatic access.
Creating a load balancer programmatically
Get the model UUIDs that you want to route to by using this command:
curl -sS -H "Authorization: Bearer ${TOKEN}" "https://<region>.<cloud-provider-domain>/ml/gateway/v1/models" | jq '.data[] | {id, uuid}'
Available load balancer types:
- Round Robin (
round_robin): Distributes requests evenly across all models in sequence - Least Connections (
least_connections): Routes requests to the model with the fewest active connections - Weighted Round Robin (
weighted_round_robin): Distributes requests based on assigned weights - Quota Priority (
quota_priority): Routes based on quota limits and priority levels
Some load balancer types allow additional parameters:
-
For
weighted_round_robin, set weight in each backend (positive integer, defaults to 1 for backends without explicit weight)Example:
{"algorithm": "weighted_round_robin", "backends": [{"model_uuid": "...", "weight": 5}]} -
For
quota_priority, set quota (max concurrent connections, ≥0) and priority (order, lower=first) for each backend. Providers are tried in priority order (0→1→2), selecting first under quota.Example:
{"algorithm": "quota_priority", "backends": [{"model_uuid": "...", "quota": 10, "priority": 0}]}
Example: Creating a load balancer with round_robin type:
curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/load-balancers" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"name": "primary-router",
"alias": "chat-balanced",
"algorithm": "round_robin",
"backends": [
{"model_uuid": "11111111-1111-1111-1111-111111111111"},
{"model_uuid": "22222222-2222-2222-2222-222222222222"}
]
}'
Example: Creating a load balancer with quota_priority type:
curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/load-balancers" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"name": "quota-priority-router",
"alias": "qp-balanced",
"algorithm": "quota_priority",
"backends": [
{
"model_uuid": "11111111-1111-1111-1111-111111111111",
"quota": 10,
"priority": 0,
"weight": 1
},
{
"model_uuid": "22222222-2222-2222-2222-222222222222",
"quota": 5,
"priority": 1,
"weight": 1
}
]
}'
Managing rate limits
Set request-based rate limits to control the number of API requests that can be made within a specific time frame. With rate limits, you can prevent excessive workloads from exhausting shared capacity and ensure fair use of capacity across your providers.
Managing rate limits from the model gateway UI
To manage rate limits from the model gateway UI:
- Select the Rules tab.
- Inside the Rules tab, select the Rate limit tab and then click Create rate limit.
- Select the scope for the new rate limit. Choose one of:
ModelProviderTenant
- Click Add model, select a model, and click Apply. Then click Next.
- Set rate limits and then lick Create to create a new rate limit.
Available settings for rate limits
When creating a rate limit, you can configure these types of limits to control API usage:
- Request Limits
- Token Limits
Enable Request Limit controls the number of API requests that are allowed within a specified time period.
Configuration fields:
- Request rate (required): Number of requests that are allowed per duration
- Capacity: Maximum bucket size in the token bucket algorithm. This setting defines the total number of requests that can accumulate at any given time. The capacity determines:
- Burst Allowance: How many requests can be made immediately when the bucket is full
- Accumulation Limit: Tokens refill up to capacity but never exceed it
- Works with the request rate (tokens added per interval) and duration (refill interval)
- Duration: Time window for the rate limit
Enable Token Limit controls the number of tokens that are consumed within a specified period.
Configuration fields:
- Request rate (required): Number of tokens that are allowed per duration
- Capacity: Maximum bucket size in the token bucket algorithm. This setting defines the total number of tokens that can accumulate at any given time. The capacity determines:
- Burst Allowance: How many tokens can be consumed immediately when the bucket is full
- Accumulation Limit: Tokens refill up to capacity but never exceed it
- Works with the token rate (tokens added per interval) and duration (refill interval)
- Duration: Time window for the token limit
Example configurations:
- Request Limit: 1000 requests per 1 hour with burst capacity of 100
- Token Limit: 50000 tokens per 30 minutes with burst capacity of 5000
Understanding capacity with an example:
Consider a rate limit configuration with:
- Capacity: 100
- Request rate: 10
- Duration: 1 minute
This configuration means that:
- You can make a burst of 100 requests immediately when the bucket is full
- The bucket refills at a rate of 10 tokens per minute
- After using all 100 tokens, it takes 10 minutes to fully recover (100 tokens ÷ 10 tokens/minute = 10 minutes)
- When the bucket reaches 100 tokens, it stops accumulating (the capacity limit)
Modifying an existing rate limit
To modify an existing rate limit, in the Rate limit tab, use the search input field to find the rate limit entry and then:
- Click Edit, make your updates, and then click Save.
- Click Delete.
Some configuration options for model gateway might be accessible only programmatically and have effects that apply only to programmatic access.
Managing rate limits programmatically
Add a tenant-wide limit. For example, see the following command:
curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/rate-limits" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"type": "tenant",
"request": {"capacity": 60, "amount": 10, "duration": "1m"}
}'
Add a provider-scoped limit. For example, see the following command:
curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/rate-limits" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"type": "provider",
"provider_uuid": "8c7c5cb7-6b20-4d2e-b7c5-3c0e888b2e2b",
"request": {"capacity": 30, "amount": 5, "duration": "1m"}
}'
Add a model-scoped limit. For example, see the following command:
curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/rate-limits" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"type": "model",
"model_uuid": "7fa969e7-d315-4777-b2f7-5ea5500bf211",
"request": {"capacity": 15, "amount": 3, "duration": "1m"}
}'
What do do next
You can now send requests to models through the model gateway. For details, see Inferencing gateway models.
Learn more
- For more details on managing the model gateway, see the Model gateway API documentation.