Run with Kubernetes and KServe ModelMesh Serving

IBM recommends KServe ModelMesh Serving to serve Watson NLP models. Kserve is a Kubernetes-based platform for ML model inference. It supports several standard ML model formats, including: TensorFlow, PyTorch ScriptModule, ONNX, scikit-learn, XGBoost, LightGBM, and OpenVINO IR.

KServe can also be extended to support custom runtimes with arbitrary model formats, such as Watson NLP runtime. KServe ModelMesh Serving is a recently added feature intended to increase Kserve's scalability. It is designed to handle large volumes of models, where the deployed models change frequently. It loads and unloads models aiming to balance between responsiveness to users, and computational footprint.

The basic setup steps to get running are to:

  1. Provision external etcd and S3 resources
  2. Install KServe ModelMesh Serving onto your cluster
  3. Deploy your Watson NLP ServingRuntime
  4. Deploy a model upload job for each pretrained model you want to serve
  5. Deploy an InferenceService for each model

This document will run through a basic tutorial of getting up and running with KServe. The resulting deployments are for experimentation and demo purposes and would not be suitable for a production environment.

Installing KServe ModelMesh Serving

See the KServe ModelMesh Serving installation instructions for detailed instructions on how to install KServe with ModelMesh onto your cluster.

Installation requires a Kubernetes cluster. You will need cluster-admin authority in order to complete all the prescribed steps. You should also be familiar with the concept of custom resources . A standard installation also assumes you have access to etcd and S3 storage.

Create an image pull secret

You will need to create a pull secret with your entitlement key to pull images from the entitled registry. See Accessing the files and Pull an image from a private registry .

Create an image pull secret named ibm-entitlement-key, and then add a new ServiceAccount that references the pull secret.

apiVersion: v1
  - name: ibm-entitlement-key
kind: ServiceAccount
  name: pull-secret-sa
  namespace: modelmesh-serving

Update the Model Serving config

The modelmesh-serving controller has a number of configuration options specified here.

For this tutorial, we will go ahead and disable the KServe REST Proxy. This proxy is not currently compatible with the Watson NLP Runtime. We will also configure the controller to use the new pull-secret-sa service account so that our pods can access the entitled registry.

apiVersion: v1
kind: ConfigMap
  name: model-serving-config
  config.yaml: |
    #Sample config overrides
    serviceAccountName: pull-secret-sa
      enabled: false

Create a serving runtime

A serving runtime is a template for a pod that can serve one or more particular model formats. The following sample will create a simple serving runtime for Watson NLP models.

Notice a few important overrides here:

  • The metrics port is changed, to not conflict with model mesh's puller container
  • The command line arguments are overriden to spin up the gRPC server directly, instead of booting both the gRPC and REST servers
kind: ServingRuntime
  name: watson-nlp-runtime
  - env:
      - name: ACCEPT_LICENSE
        value: "true"
      - name: LOG_LEVEL
        value: info
      - name: CAPACITY
        value: "6000000000"
      - name: DEFAULT_MODEL_SIZE
        value: "500000000"
      - name: METRICS_PORT
        value: "2113"
      - --  
      - python3
      - -m
      - watson_runtime.grpc_server
    imagePullPolicy: IfNotPresent
    name: watson-nlp-runtime
        cpu: 2
        memory: 8Gi
        cpu: 1
        memory: 8Gi
  grpcDataEndpoint: port:8085
  grpcEndpoint: port:8085
  multiModel: true
    disabled: false
    - autoSelect: true
      name: watson-nlp

You should be able to see the status of the runtime with

kubectl get servingruntimes

You will also be able to see the pods spin up for the runtime, and inspect them directly for any debugging and troubleshooting.

Upload pretrained models to S3

The pretrained model containers come with the ability to run as S3 upload jobs that reference a kserve storage config secret. For each model that you want to serve, you can deploy an upload job like the following:

apiVersion: batch/v1
kind: Job
  name: model-upload
        - name: syntax-izumo-en-stock
            - name: UPLOAD
              value: "true"
            - name: ACCEPT_LICENSE
              value: "true"
            - name: S3_CONFIG_FILE
              value: /storage-config/localMinIO
            - name: UPLOAD_PATH
              value: models
            - mountPath: /storage-config
              name: storage-config
              readOnly: true
        - name: storage-config
            defaultMode: 420
            secretName: storage-config
      restartPolicy: Never
      serviceAccountName: pull-secret-sa
  backoffLimit: 2

Note that this assumes your storage-config secret exists, and that the localMinIO key exists within it. This should have been created by the quickstart install of modelmesh-serving. You can configure other remote storage locations in that secret and reference them here.

See also how UPLOAD_PATH is set to upload to the ${bucket}/models/${model_name} path.

Create an InferenceService predictor for models

InferenceServices represent a logical endpoint for serving predictions using a particular model. Watson NLP models must be stored in an S3 compatible object store to be served by KServe ModelMesh Serving.

For each model that you want to serve, create an InferenceService like the following:

kind: InferenceService
  name: syntax-izumo-en
  annotations: ModelMesh
        name: watson-nlp
        path: models/syntax_izumo_lang_en_stock
        key: localMinIO

Note that the storage config should match the location where the model was uploaded.

Once the model is successfully loaded, you will see the READY status is True, when checked with the following command:

kubectl get inferenceservice

Querying your InferenceService

IMPORTANT: A key difference from other deployment modes is that you must query your models via ModelMesh instead of invoking the runtime API directly. This means that:

  • Only the gRPC API is supported
  • The mm-vmodel-id metadata key must be supplied with the name of the InferenceService to query, instead of the mm-model-id metadata header.

First, port-forward the model-mesh service

kubectl port-forward --address service/modelmesh-serving  8033 -n modelmesh-serving

Then use the python client library to query the new InferenceService. (See instructions for installing the client library here)

import grpc
from watson_nlp_runtime_client import (

# No TLS
# Note the 8033 port to talk to model-mesh directly
channel = grpc.insecure_channel("localhost:8033")

stub = common_service_pb2_grpc.NlpServiceStub(channel)

request = common_service_pb2.SyntaxRequest(
    raw_document=syntax_types_pb2.RawDocument(text="This is a test"),
    parsers=("sentence", "token", "part_of_speech", "lemma", "dependency"),

# Note the `mm-vmodel-id` header with the name of the InferenceService
response = stub.SyntaxPredict(
    request, metadata=[("mm-vmodel-id", "syntax-izumo-en")]


Other Resources

To see a tutorial that takes you through the steps to deploy a Watson NLP model to the KServe ModelMesh Serving sandbox environment on IBM Technology Zone (TechZone), check out Deploy a Watson NLP Model to KServe ModelMesh Serving on GitHub.

Once you have your runtime server working, see Accessing client libraries and tools to continue.