Managing the model gateway (preview)

Manage existing connections and models, enable load balancing to distribute traffic efficiently across multiple models, create access policies to define which groups can access specific resources, and set rate limits to control and prevent request overload.

Requirements

To manage the model gateway, you must have one of the following permissions:

Administrator platform
Manage configurations

Managing existing connections and models

To manage existing connections and models:

Select the Model provider tab. The list of connections opens.
Use the search input field to find one of the connections and then:
- Click the Edit icon and then select either Edit credentials or Edit models.
- Click the Delete icon.
Managing existing connections and models

Note:

If a provider was created by using API then the secret that is associated with this provider is not populated automatically. You must manually select the secret that was used when creating the provider. Secrets that were created by using API have this format: mg-<connection name>-<six random characters>.

Managing access policies

By default, the model gateway can communicate with all model providers. Restrict access to only the providers that you want to use. Use policies to keep providers, models, and load balancers scoped to intended user groups and prevent unintended exposure.

Adding access policies from the model gateway UI

To add an access policy from the model gateway UI:

Select the Access control tab and then click Assign access.
Click Select access group and then select one of the available access groups. If you don't have any access groups set up, see Setting up IAM access groups.
From the Resource type menu, select Model or Load balancer.
Click Resource and select the resource that the access policy applies to.
Select the action and permission type and then click Create.

The added access policies are listed in the Access control tab. Use the search input field to find them and the Sort menu to sort them by using the avaiable criteria.

You can also edit existing access policies.

Note:

Some configuration options for model gateway might be accessible only programmatically and have effects that apply only to programmatic access.

Adding access policies programmatically

Create an access policy that grants the group that is specified in the subject parameter permission to access a resource. Resources can be the following:

model:<uuid>
provider:<uuid>
loadbalancer:<uuid>

To manage policies, use policy with action read or write. For example, the following command grants read access to the model that you specify by using the uuid:

curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/policies" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
    "subject": "<group_id>",
    "resource": "model:<uuid>",
    "action": "read",
    "effect": "allow"
}'

To view created policies, run the following command:

curl -sS -H "Authorization: Bearer ${TOKEN}" "https://<region>.<cloud-provider-domain>/ml/gateway/v1/policies"

Managing load balancers

Enable load balancing to distribute inference requests across multiple backend instances. Use a single stable alias while scaling backend capacity to handle high traffic.

Creating a load balancer from the model gateway UI

To create a load balancer from the model gateway UI:

Select the Rules tab.
Inside the Rules tab, select the Load balancer tab and then click Create load balancer.
Type the name for your new load balancer and then select the balancer type:
- Round Robin: Distributes requests evenly across all models in sequence
- Least Connections: Routes requests to the model with the fewest active connections
- Weighted Round Robin: Distributes requests based on assigned weights
- Quota Priority: Routes based on quota limits and priority levels
Click Select models to select the models that the new load balancer applies to.
Optional: Set additional parameters if required by the selected balancer type.
Click Create to create a new load balancer.

The added load balancers are listed in the Load balancers tab. Use the search input field to find them and the Sort menu to sort them by using the avaiable criteria.

Modifying an existing load balancer

To modify an existing load balancer, in the Load balancer tab, use the search input field to find the load balancer entry and then:

Click Edit, make your updates, and then click Update.
Click Delete.

Note:

Some configuration options for model gateway might be accessible only programmatically and have effects that apply only to programmatic access.

Creating a load balancer programmatically

Get the model UUIDs that you want to route to by using this command:

curl -sS -H "Authorization: Bearer ${TOKEN}" "https://<region>.<cloud-provider-domain>/ml/gateway/v1/models" | jq '.data[] | {id, uuid}'

Available load balancer types:

Round Robin (round_robin): Distributes requests evenly across all models in sequence
Least Connections (least_connections): Routes requests to the model with the fewest active connections
Weighted Round Robin (weighted_round_robin): Distributes requests based on assigned weights
Quota Priority (quota_priority): Routes based on quota limits and priority levels

Some load balancer types allow additional parameters:

For weighted_round_robin, set weight in each backend (positive integer, defaults to 1 for backends without explicit weight)

Example: {"algorithm": "weighted_round_robin", "backends": [{"model_uuid": "...", "weight": 5}]}
For quota_priority, set quota (max concurrent connections, ≥0) and priority (order, lower=first) for each backend. Providers are tried in priority order (0→1→2), selecting first under quota.

Example: {"algorithm": "quota_priority", "backends": [{"model_uuid": "...", "quota": 10, "priority": 0}]}

Example: Creating a load balancer with round_robin type:

curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/load-balancers" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
"name": "primary-router",
"alias": "chat-balanced",
"algorithm": "round_robin",
"backends": [
    {"model_uuid": "11111111-1111-1111-1111-111111111111"},
    {"model_uuid": "22222222-2222-2222-2222-222222222222"}
]
}'

Example: Creating a load balancer with quota_priority type:

curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/load-balancers" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
  "name": "quota-priority-router",
  "alias": "qp-balanced",
  "algorithm": "quota_priority",
  "backends": [
    {
      "model_uuid": "11111111-1111-1111-1111-111111111111",
      "quota": 10,
      "priority": 0,
      "weight": 1
    },
    {
      "model_uuid": "22222222-2222-2222-2222-222222222222",
      "quota": 5,
      "priority": 1,
      "weight": 1
    }
  ]
}'

Managing rate limits

Set request-based rate limits to control the number of API requests that can be made within a specific time frame. With rate limits, you can prevent excessive workloads from exhausting shared capacity and ensure fair use of capacity across your providers.

Managing rate limits from the model gateway UI

To manage rate limits from the model gateway UI:

Select the Rules tab.
Inside the Rules tab, select the Rate limit tab and then click Create rate limit.
Select the scope for the new rate limit. Choose one of:
- Model
- Provider
- Tenant
Click Add model, select a model, and click Apply. Then click Next.
Set rate limits and then lick Create to create a new rate limit.

Available settings for rate limits

When creating a rate limit, you can configure these types of limits to control API usage:

Request Limits
Token Limits

Note: At least one limit type must be enabled with a rate value greater than 0. You must specify duration for any enabled limit.

Enable Request Limit controls the number of API requests that are allowed within a specified time period.

Configuration fields:

Request rate (required): Number of requests that are allowed per duration
Capacity: Maximum bucket size in the token bucket algorithm. This setting defines the total number of requests that can accumulate at any given time. The capacity determines:
- Burst Allowance: How many requests can be made immediately when the bucket is full
- Accumulation Limit: Tokens refill up to capacity but never exceed it
- Works with the request rate (tokens added per interval) and duration (refill interval)
Duration: Time window for the rate limit

Enable Token Limit controls the number of tokens that are consumed within a specified period.

Configuration fields:

Request rate (required): Number of tokens that are allowed per duration
Capacity: Maximum bucket size in the token bucket algorithm. This setting defines the total number of tokens that can accumulate at any given time. The capacity determines:
- Burst Allowance: How many tokens can be consumed immediately when the bucket is full
- Accumulation Limit: Tokens refill up to capacity but never exceed it
- Works with the token rate (tokens added per interval) and duration (refill interval)
Duration: Time window for the token limit

Example configurations:

Request Limit: 1000 requests per 1 hour with burst capacity of 100
Token Limit: 50000 tokens per 30 minutes with burst capacity of 5000

Understanding capacity with an example:

Consider a rate limit configuration with:

Capacity: 100
Request rate: 10
Duration: 1 minute

This configuration means that:

You can make a burst of 100 requests immediately when the bucket is full
The bucket refills at a rate of 10 tokens per minute
After using all 100 tokens, it takes 10 minutes to fully recover (100 tokens ÷ 10 tokens/minute = 10 minutes)
When the bucket reaches 100 tokens, it stops accumulating (the capacity limit)

Modifying an existing rate limit

To modify an existing rate limit, in the Rate limit tab, use the search input field to find the rate limit entry and then:

Click Edit, make your updates, and then click Save.
Click Delete.

Note:

Some configuration options for model gateway might be accessible only programmatically and have effects that apply only to programmatic access.

Managing rate limits programmatically

Add a tenant-wide limit. For example, see the following command:

curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/rate-limits" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
  "type": "tenant",
  "request": {"capacity": 60, "amount": 10, "duration": "1m"}
}'

Add a provider-scoped limit. For example, see the following command:

curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/rate-limits" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
  "type": "provider",
  "provider_uuid": "8c7c5cb7-6b20-4d2e-b7c5-3c0e888b2e2b",
  "request": {"capacity": 30, "amount": 5, "duration": "1m"}
}'

Add a model-scoped limit. For example, see the following command:

curl -sS -X POST "https://<region>.<cloud-provider-domain>/ml/gateway/v1/rate-limits" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TOKEN}" \
-d '{
  "type": "model",
  "model_uuid": "7fa969e7-d315-4777-b2f7-5ea5500bf211",
  "request": {"capacity": 15, "amount": 3, "duration": "1m"}
}'

What do do next

You can now send requests to models through the model gateway. For details, see Inferencing gateway models.

Learn more

For more details on managing the model gateway, see the Model gateway API documentation.