Configuring event-driven scaling for models

If the custom metrics autoscaler is configured for this instance of IBM Software Hub, you can create scaled objects to enable event-driven scaling of model replicas. Event-driven scaling enables the cluster to automatically scale model replicas on existing GPU nodes in response to inferencing requests.

Permissions that you need for this task

You must be either:

A cluster administrator
An instance administrator

When you need to complete this task

This task is optional. Complete this task only if you want to allow the cluster to scale model replicas in response to inferencing requests.

Before you begin

A cluster administrator must complete the following tasks:

Best practice: You can run many of the commands in this task exactly as written if you set up environment variables for your installation. For instructions, see Setting up installation environment variables.

Ensure that you source the environment variables before you run the commands in this task.

About this task

If you use Inference foundation models to start and host models, you can configure event-driven automatic scaling for models that support business critical tasks.

Important: Ensure that you have sufficient GPU on your cluster to support automatic scaling of model replicas.

If you host multiple models, ensure that your scaled objects will not prevent other models from creating pods. If you don't have sufficient GPU to support the maximum number of replicas, some pods will be pending until GPU resources are available.

To scale a model, you must know the model name. For a complete list of models, see GPU requirements for models.

Procedure

Set the following environment variables:
1. Set the MODEL_NAME environment variable to the name of the model for which you want to configure event-driven scaling:
```
export MODEL_NAME=<model-name>
```
2. Set the DEPLOYMENT_NAME environment variable:
```
export DEPLOYMENT_NAME="${MODEL_NAME}-predictor"
```
3. Set the MIN_REPLICAS environment variable to the minimum number of replicas that should be available at all times:
```
export MIN_REPLICAS=<integer>
```
4. Set the MAX_REPLICAS environment variable to the maximum number of replicas that will be allowed at any given time:
```
export MAX_REPLICAS=<integer>
```

Verify that the deployment exists:

oc get deployment ${DEPLOYMENT_NAME} \
--namespace=${PROJECT_CPD_INST_OPERANDS}

Save the following script on the client workstation as a file named check-metrics-port.sh:

#!/bin/bash

# Check if metrics port exists
EXISTS=$(oc get deployment "${DEPLOYMENT_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" \
  -o jsonpath="{.spec.template.spec.containers[?(@.name=='watsonx-router-container')].ports[?(@.containerPort==19092)].containerPort}")

# Patch if not exists
if [[ -z "${EXISTS}" ]]; then
  oc patch deployment "${DEPLOYMENT_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" --type='strategic' --patch "
spec:
  template:
    spec:
      containers:
      - name: watsonx-router-container
        ports:
        - name: metrics
          containerPort: 19092
          protocol: TCP
"
  oc rollout status deployment/"${DEPLOYMENT_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" --timeout=300s
fi

Run the check-metrics-port.sh script to update the deployment if the required port does not exist:
```
./check-metrics-port.sh
```

Create a pod monitor to add a scraping job to Prometheus:

cat << EOF | oc apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: ${MODEL_NAME}-router-monitor
  namespace: ${PROJECT_CPD_INST_OPERANDS}
spec:
  podMetricsEndpoints:
  - interval: 5s
    path: /metrics
    port: metrics
  selector:
    matchLabels:
      serving.kserve.io/inferenceservice: ${MODEL_NAME}
EOF

Save the following script on the client workstation as a file named manage-autoscalers.sh

#!/bin/bash

# Delete existing HPA if present
HPA_NAME=$(oc get hpa -n "${PROJECT_CPD_INST_OPERANDS}" --no-headers 2>/dev/null | \
  awk -v dep="${DEPLOYMENT_NAME}" '$1 ~ dep {print $1}')

if [[ -n "${HPA_NAME}" ]]; then
  oc delete hpa "${HPA_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}"
fi

# Get InferenceService name
ISVC_NAME=$(oc get deployment "${DEPLOYMENT_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" \
  -o jsonpath='{.metadata.labels.serving\.kserve\.io/inferenceservice}')

# Patch InferenceService for external autoscaling
if [[ -n "${ISVC_NAME}" ]] && oc get inferenceservice "${ISVC_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" &>/dev/null; then
  CURRENT_ANNOTATION=$(oc get inferenceservice "${ISVC_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" \
    -o jsonpath='{.metadata.annotations.serving\.kserve\.io/autoscalerClass}' 2>/dev/null)
  
  if [[ "${CURRENT_ANNOTATION}" != "external" ]]; then
    oc patch inferenceservice "${ISVC_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" --type='json' -p='[
      {
        "op": "add",
        "path": "/metadata/annotations/serving.kserve.io~1autoscalerClass",
        "value": "external"
      }
    ]'
    sleep 10
  fi
fi

Run the manage-autoscalers.sh script to delete any existing horizontal pod autoscaling configurations for the deployment and to configure the inference service to allow external autoscaling:
```
./manage-autoscalers.sh
```

Save the following script on the client workstation as a file named get-model-id.sh

#!/bin/bash

# Check IAM status
IAM_ENABLED=$(oc get zenservice lite-cr -n "${PROJECT_CPD_INST_OPERANDS}" \
  -o jsonpath='{.spec.iamIntegration}' 2>/dev/null || echo "false")

# Get CPD route
CPD_ROUTE=$(oc get route cpd -n "${PROJECT_CPD_INST_OPERANDS}" --template='{{ .spec.host }}')
CURL_IT_HOST="https://${CPD_ROUTE}"

# Authenticate based on IAM status
if [[ "${IAM_ENABLED}" == "true" ]]; then
  # IAM authentication
  CURL_IT_USER=$(oc get secret platform-auth-idp-credentials -n "${PROJECT_CPD_INST_OPERANDS}" \
    -o jsonpath='{.data.admin_username}' | base64 --decode)
  CURL_IT_PASS=$(oc get secret platform-auth-idp-credentials -n "${PROJECT_CPD_INST_OPERANDS}" \
    -o jsonpath='{.data.admin_password}' | base64 --decode)
  
  IAM_TOKEN=$(curl -ks -X POST \
    -H "Content-Type: application/x-www-form-urlencoded;charset=UTF-8" \
    -d "grant_type=password&username=${CURL_IT_USER}&password=${CURL_IT_PASS}&scope=openid" \
    "${CURL_IT_HOST}/idprovider/v1/auth/identitytoken" \
    | jq -r '.access_token')
  
  CPD_TOKEN="Bearer $(curl -ks -X GET \
    "${CURL_IT_HOST}/v1/preauth/validateAuth" \
    -H "username: ${CURL_IT_USER}" \
    -H "iam-token: ${IAM_TOKEN}" \
    | jq -r '.accessToken')"
else
  # Non-IAM authentication
  CURL_IT_USER="admin"
  CURL_IT_PASS=$(oc extract secret/admin-user-details -n "${PROJECT_CPD_INST_OPERANDS}" \
    --keys=initial_admin_password --to=-)
  
  CPD_TOKEN="Bearer $(curl -k -s -X GET \
    "${CURL_IT_HOST}/v1/preauth/validateAuth" \
    -u "${CURL_IT_USER}:${CURL_IT_PASS}" \
    | jq -r '.accessToken')"
fi

# Get model_id
export ACTUAL_MODEL_ID=$(curl -k -s "${CURL_IT_HOST}/ml/v1/foundation_model_specs?version=2023-10-25" \
  --max-time 30 \
  --header "Authorization: ${CPD_TOKEN}" \
  --header 'Content-Type: application/json' \
  | jq -r ".resources[] | select(.label == \"${MODEL_NAME}\") | .model_id")

Source the environment variables that were set in the get-model-id.sh script:
```
source ./get-model-id.sh
```
Run the get-model-id.sh script to get the model ID:
```
./get-model-id.sh
```

Save the following script on the client workstation as a file named create-scaled-object.sh

#!/bin/bash

# Set scaling parameters
MIN_REPLICAS=${MIN_REPLICAS:-1}
MAX_REPLICAS=${MAX_REPLICAS:-3}

# Get Thanos URL
THANOS_URL=$(oc get route thanos-querier -n openshift-monitoring -o jsonpath='{.spec.host}')

# Create ScaledObject
cat << EOF | oc apply -f -
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ${DEPLOYMENT_NAME}-scaler
  namespace: ${PROJECT_CPD_INST_OPERANDS}
spec:
  scaleTargetRef:
    kind: Deployment
    name: ${DEPLOYMENT_NAME}
  pollingInterval: 10
  cooldownPeriod: 60
  minReplicaCount: ${MIN_REPLICAS}
  maxReplicaCount: ${MAX_REPLICAS}
  advanced:
    restoreToOriginalReplicaCount: true
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 900
          policies:
            - type: Percent
              periodSeconds: 60
              value: 50
            - type: Pods
              periodSeconds: 60
              value: 5
          selectPolicy: Min
        scaleUp:
          policies:
            - type: Percent
              periodSeconds: 300
              value: 100
            - type: Pods
              periodSeconds: 300
              value: 5
          selectPolicy: Min
    scalingModifiers:
      formula: (((p95InQueueValue / 5000 ) >= 1 || (avgLoadPercentValue / 0.9 ) < 1 ) ? max ((p95InQueueValue / 5000 ), (avgLoadPercentValue / 0.9 ) ):1)
      target: "1.0"
      metricType: "Value"
  triggers:
  - type: prometheus
    name: p95InQueueValue
    useCachedMetrics: true
    metadata:
      authModes: bearer
      namespace: ${PROJECT_CPD_INST_OPERANDS}
      serverAddress: https://${THANOS_URL}
      metricName: "p95_in_queue_ms_1m"
      query: "max(quantile_over_time(0.95, wx_router_in_queue_duration_msec{model_id=~\"${ACTUAL_MODEL_ID}\", sla_value!=\"na\"}[1m])) or on() vector(0)"
      threshold: "5000"
      activationThreshold: "0.1"
      unsafeSsl: "true"
    authenticationRef:
      name: keda-thanos-auth
  - type: prometheus
    name: avgLoadPercentValue
    useCachedMetrics: true
    metadata:
      authModes: bearer
      namespace: ${PROJECT_CPD_INST_OPERANDS}
      serverAddress: https://${THANOS_URL}
      metricName: "avg_load_percent_1m"
      query: "avg(avg_over_time(wx_router_receiver_load{model_id=~\"${ACTUAL_MODEL_ID}\"}[1m])) or on() vector(0)"
      threshold: "0.9"
      activationThreshold: "0.01"
      unsafeSsl: "true"
    authenticationRef:
      name: keda-thanos-auth
EOF

Run the create-scaled-object.sh script to create the scaled object:
```
./create-scaled-object.sh
```
Optional: To confirm that the scaled object is working correctly, use the following watch command to monitor the resources created by the configuration as users submit inference requests:
```
watch -n 3 'oc get scaledobject,hpa,deployment -n ${PROJECT_CPD_INST_OPERANDS}'
```
You can exit the command by pressing Ctrl+C.