Configuring event-driven scaling for models

If the custom metrics autoscaler is configured for this instance of IBM Software Hub, you can create scaled objects to enable event-driven scaling of model replicas. Event-driven scaling enables the cluster to automatically scale model replicas on existing GPU nodes in response to inferencing requests.

Permissions that you need for this task
You must be either:
  • A cluster administrator
  • An instance administrator
When you need to complete this task
This task is optional. Complete this task only if you want to allow the cluster to scale model replicas in response to inferencing requests.

Before you begin

Best practice: You can run many of the commands in this task exactly as written if you set up environment variables for your installation. For instructions, see Setting up installation environment variables.

Ensure that you source the environment variables before you run the commands in this task.

About this task

If you use Inference foundation models to start and host models, you can configure event-driven automatic scaling for models that support business critical tasks.

Important: Ensure that you have sufficient GPU on your cluster to support automatic scaling of model replicas.

If you host multiple models, ensure that your scaled objects will not prevent other models from creating pods. If you don't have sufficient GPU to support the maximum number of replicas, some pods will be pending until GPU resources are available.

To scale a model, you must know the model name. For a complete list of models, see GPU requirements for models.

Procedure

  1. Set the following environment variables:
    1. Set the MODEL_NAME environment variable to the name of the model for which you want to configure event-driven scaling:
      export MODEL_NAME=<model-name>
    2. Set the DEPLOYMENT_NAME environment variable:
      export DEPLOYMENT_NAME="${MODEL_NAME}-predictor"
    3. Set the MIN_REPLICAS environment variable to the minimum number of replicas that should be available at all times:
      export MIN_REPLICAS=<integer>
    4. Set the MAX_REPLICAS environment variable to the maximum number of replicas that will be allowed at any given time:
      export MAX_REPLICAS=<integer>
  2. Verify that the deployment exists:
    oc get deployment ${DEPLOYMENT_NAME} \
    --namespace=${PROJECT_CPD_INST_OPERANDS}
  3. Save the following script on the client workstation as a file named check-metrics-port.sh:
    #!/bin/bash
    
    # Check if metrics port exists
    EXISTS=$(oc get deployment "${DEPLOYMENT_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" \
      -o jsonpath="{.spec.template.spec.containers[?(@.name=='watsonx-router-container')].ports[?(@.containerPort==19092)].containerPort}")
    
    # Patch if not exists
    if [[ -z "${EXISTS}" ]]; then
      oc patch deployment "${DEPLOYMENT_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" --type='strategic' --patch "
    spec:
      template:
        spec:
          containers:
          - name: watsonx-router-container
            ports:
            - name: metrics
              containerPort: 19092
              protocol: TCP
    "
      oc rollout status deployment/"${DEPLOYMENT_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" --timeout=300s
    fi
  4. Run the check-metrics-port.sh script to update the deployment if the required port does not exist:
    ./check-metrics-port.sh
  5. Create a pod monitor to add a scraping job to Prometheus:
    cat << EOF | oc apply -f -
    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
      name: ${MODEL_NAME}-router-monitor
      namespace: ${PROJECT_CPD_INST_OPERANDS}
    spec:
      podMetricsEndpoints:
      - interval: 5s
        path: /metrics
        port: metrics
      selector:
        matchLabels:
          serving.kserve.io/inferenceservice: ${MODEL_NAME}
    EOF
  6. Save the following script on the client workstation as a file named manage-autoscalers.sh
    #!/bin/bash
    
    # Delete existing HPA if present
    HPA_NAME=$(oc get hpa -n "${PROJECT_CPD_INST_OPERANDS}" --no-headers 2>/dev/null | \
      awk -v dep="${DEPLOYMENT_NAME}" '$1 ~ dep {print $1}')
    
    if [[ -n "${HPA_NAME}" ]]; then
      oc delete hpa "${HPA_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}"
    fi
    
    # Get InferenceService name
    ISVC_NAME=$(oc get deployment "${DEPLOYMENT_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" \
      -o jsonpath='{.metadata.labels.serving\.kserve\.io/inferenceservice}')
    
    # Patch InferenceService for external autoscaling
    if [[ -n "${ISVC_NAME}" ]] && oc get inferenceservice "${ISVC_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" &>/dev/null; then
      CURRENT_ANNOTATION=$(oc get inferenceservice "${ISVC_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" \
        -o jsonpath='{.metadata.annotations.serving\.kserve\.io/autoscalerClass}' 2>/dev/null)
      
      if [[ "${CURRENT_ANNOTATION}" != "external" ]]; then
        oc patch inferenceservice "${ISVC_NAME}" -n "${PROJECT_CPD_INST_OPERANDS}" --type='json' -p='[
          {
            "op": "add",
            "path": "/metadata/annotations/serving.kserve.io~1autoscalerClass",
            "value": "external"
          }
        ]'
        sleep 10
      fi
    fi
  7. Run the manage-autoscalers.sh script to delete any existing horizontal pod autoscaling configurations for the deployment and to configure the inference service to allow external autoscaling:
    ./manage-autoscalers.sh
  8. Save the following script on the client workstation as a file named get-model-id.sh
    #!/bin/bash
    
    # Check IAM status
    IAM_ENABLED=$(oc get zenservice lite-cr -n "${PROJECT_CPD_INST_OPERANDS}" \
      -o jsonpath='{.spec.iamIntegration}' 2>/dev/null || echo "false")
    
    # Get CPD route
    CPD_ROUTE=$(oc get route cpd -n "${PROJECT_CPD_INST_OPERANDS}" --template='{{ .spec.host }}')
    CURL_IT_HOST="https://${CPD_ROUTE}"
    
    # Authenticate based on IAM status
    if [[ "${IAM_ENABLED}" == "true" ]]; then
      # IAM authentication
      CURL_IT_USER=$(oc get secret platform-auth-idp-credentials -n "${PROJECT_CPD_INST_OPERANDS}" \
        -o jsonpath='{.data.admin_username}' | base64 --decode)
      CURL_IT_PASS=$(oc get secret platform-auth-idp-credentials -n "${PROJECT_CPD_INST_OPERANDS}" \
        -o jsonpath='{.data.admin_password}' | base64 --decode)
      
      IAM_TOKEN=$(curl -ks -X POST \
        -H "Content-Type: application/x-www-form-urlencoded;charset=UTF-8" \
        -d "grant_type=password&username=${CURL_IT_USER}&password=${CURL_IT_PASS}&scope=openid" \
        "${CURL_IT_HOST}/idprovider/v1/auth/identitytoken" \
        | jq -r '.access_token')
      
      CPD_TOKEN="Bearer $(curl -ks -X GET \
        "${CURL_IT_HOST}/v1/preauth/validateAuth" \
        -H "username: ${CURL_IT_USER}" \
        -H "iam-token: ${IAM_TOKEN}" \
        | jq -r '.accessToken')"
    else
      # Non-IAM authentication
      CURL_IT_USER="admin"
      CURL_IT_PASS=$(oc extract secret/admin-user-details -n "${PROJECT_CPD_INST_OPERANDS}" \
        --keys=initial_admin_password --to=-)
      
      CPD_TOKEN="Bearer $(curl -k -s -X GET \
        "${CURL_IT_HOST}/v1/preauth/validateAuth" \
        -u "${CURL_IT_USER}:${CURL_IT_PASS}" \
        | jq -r '.accessToken')"
    fi
    
    # Get model_id
    export ACTUAL_MODEL_ID=$(curl -k -s "${CURL_IT_HOST}/ml/v1/foundation_model_specs?version=2023-10-25" \
      --max-time 30 \
      --header "Authorization: ${CPD_TOKEN}" \
      --header 'Content-Type: application/json' \
      | jq -r ".resources[] | select(.label == \"${MODEL_NAME}\") | .model_id")
  9. Source the environment variables that were set in the get-model-id.sh script:
    source ./get-model-id.sh
  10. Run the get-model-id.sh script to get the model ID:
    ./get-model-id.sh
  11. Save the following script on the client workstation as a file named create-scaled-object.sh
    #!/bin/bash
    
    # Set scaling parameters
    MIN_REPLICAS=${MIN_REPLICAS:-1}
    MAX_REPLICAS=${MAX_REPLICAS:-3}
    
    # Get Thanos URL
    THANOS_URL=$(oc get route thanos-querier -n openshift-monitoring -o jsonpath='{.spec.host}')
    
    # Create ScaledObject
    cat << EOF | oc apply -f -
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: ${DEPLOYMENT_NAME}-scaler
      namespace: ${PROJECT_CPD_INST_OPERANDS}
    spec:
      scaleTargetRef:
        kind: Deployment
        name: ${DEPLOYMENT_NAME}
      pollingInterval: 10
      cooldownPeriod: 60
      minReplicaCount: ${MIN_REPLICAS}
      maxReplicaCount: ${MAX_REPLICAS}
      advanced:
        restoreToOriginalReplicaCount: true
        horizontalPodAutoscalerConfig:
          behavior:
            scaleDown:
              stabilizationWindowSeconds: 900
              policies:
                - type: Percent
                  periodSeconds: 60
                  value: 50
                - type: Pods
                  periodSeconds: 60
                  value: 5
              selectPolicy: Min
            scaleUp:
              policies:
                - type: Percent
                  periodSeconds: 300
                  value: 100
                - type: Pods
                  periodSeconds: 300
                  value: 5
              selectPolicy: Min
        scalingModifiers:
          formula: (((p95InQueueValue / 5000 ) >= 1 || (avgLoadPercentValue / 0.9 ) < 1 ) ? max ((p95InQueueValue / 5000 ), (avgLoadPercentValue / 0.9 ) ):1)
          target: "1.0"
          metricType: "Value"
      triggers:
      - type: prometheus
        name: p95InQueueValue
        useCachedMetrics: true
        metadata:
          authModes: bearer
          namespace: ${PROJECT_CPD_INST_OPERANDS}
          serverAddress: https://${THANOS_URL}
          metricName: "p95_in_queue_ms_1m"
          query: "max(quantile_over_time(0.95, wx_router_in_queue_duration_msec{model_id=~\"${ACTUAL_MODEL_ID}\", sla_value!=\"na\"}[1m])) or on() vector(0)"
          threshold: "5000"
          activationThreshold: "0.1"
          unsafeSsl: "true"
        authenticationRef:
          name: keda-thanos-auth
      - type: prometheus
        name: avgLoadPercentValue
        useCachedMetrics: true
        metadata:
          authModes: bearer
          namespace: ${PROJECT_CPD_INST_OPERANDS}
          serverAddress: https://${THANOS_URL}
          metricName: "avg_load_percent_1m"
          query: "avg(avg_over_time(wx_router_receiver_load{model_id=~\"${ACTUAL_MODEL_ID}\"}[1m])) or on() vector(0)"
          threshold: "0.9"
          activationThreshold: "0.01"
          unsafeSsl: "true"
        authenticationRef:
          name: keda-thanos-auth
    EOF
  12. Run the create-scaled-object.sh script to create the scaled object:
    ./create-scaled-object.sh
  13. Optional: To confirm that the scaled object is working correctly, use the following watch command to monitor the resources created by the configuration as users submit inference requests:
    watch -n 3 'oc get scaledobject,hpa,deployment -n ${PROJECT_CPD_INST_OPERANDS}'

    You can exit the command by pressing Ctrl+C.