Configuring a WML for z/OS base cluster for high availability

If your machine learning workload is large, mission critical, or both, you can configure your WMLz base core services for high availability. To achieve high availability of core services, consider configuring a WMLz base cluster. Each cluster can consist of two or more WMLz base instances that run either on a single LPAR or across different LPARs.

Before you begin

A WMLz base instance contains a set of core services for model training, deployment, batch scoring, ingestion, repository, and data connection management, and the core services are supported by an active runtime environment. The runtime environment can be provided by Spark, Python, or both. To make core services in a WMLz base cluster highly available means to keep one runtime environment active at all time.

  • Decide the type of WMLz base cluster you want to configure. As shown below, you can configure a cluster with multiple WMLz base instances running on the same LPAR (Cluster type 1) or across different LPARs (Cluster type 2).
    Figure 1. WMLz base cluster
    Begin figure description. WMLz base cluster. End figure description
  • Set up the TCP SHAREPORT port or the sysplex distributor port to be used by the cluster. If your WMLz base cluster is type 1, enable the SHAREPORT. If your cluster is type 2, enable the sysplex distributor port.
  • Provision and install additional system capacity to support your WMLz base cluster.

    It is recommended that you plan and start each WMLz base instance in a cluster, regardless of the cluster type, with the basic system capacity as described in Planning system capacity for WML for z/OS base. You can adjust the basic capacity in terms of CPU, memory, or DASD over time based on your machine learning workload.

    If your cluster is type 1 and if your initial workload is small, you might be able to share the basic capacity of 1 GCP, 4 zIIPs, and 100 GB memory across multiple WMLz base instances on the same LPAR because only one runtime environment is active at any given time. You can increase the CPU and memory allocation as your workload increases. However, you must allocate and start with 100 GB DASD for each instance.

Procedure

  1. Complete the installation and configuration of the first WMLz base instance by completing all required tasks as described in Installation roadmap.
  2. Retrieve the following configuration information of your first WMLz base instance from the System Configuration page of the administration dashboard:
    • Keystore. All WMLz base instances in a cluster must use the same keystore for secure connections and user authentication. If a RACF® keying-based keystore is used in the first WMLz base instance, make note of the keyring name, keyring owner, and certificate label. If a file-based keystore is used, retrieve the location of your SSL certificates and the password for the keystore.jks file.
    • Metadata schema. All WMLz base instances in a cluster must use the same metadata schema. Write down the schema name as well as the metadata database, storage group, and buffer pool information.
    • Core services port. All WMLz base instances in a cluster must use the same cluster host IP and WMLz core services port number. If the cluster is type 1 where all instances run on the same LPAR, write down the LPAR IP address and the SHAREPORT number. If the cluster is type 2 where the instances run across different LPARs, write down the sysplex IP address and the SD port number.
  3. Install and configure the second WMLz base instance into the cluster.

    Follow instructions in Configuring WML for z/OS base to configure the new instance. When prompted, make sure that you specify the keystore, metadata schema, and core services port information that you collected in Step 2.

    • On the Authentication page, specify the keystore type and related information used in the first WMLz base instance. If the keystore is RACF keying-based, specify the keystore name, keyring owner, and certificate label. If the keystore is file-based, specify the same set of certificate and key files. This ensures that all WMLz base instances in the cluster uses the same keystore to secure connections and authenticate users.
    • On the Metadata repository page, specify the same metadata schema name, database, storage group, and buffer pool used in the first WMLz base instance. This ensures that all instances in the cluster use the same metadata objects.
    • On the UI and core services page, specify the cluster host IP address and WMLz core services port number. If the cluster is type 1 where all instances run on the same LPAR, specify the LPAR IP as the cluster host IP and the SHAREPORT number as the WMLz core services port. If the cluster is type 2 where the instances run across different LPARs, specify the sysplex IP address as the cluster host IP and the SD port number as the core services port.
  4. Repeat Step 3 to install and configure any additional instance into the cluster.
  5. Complete the cluster setup and start the cluster.
    1. Repeat Steps 3 - 4 to complete the cluster setup.
    2. Verify that all WMLz base instances in the cluster are started and running.
    3. Verify that the runtime environment of one WMLz base instance is active.
  6. Configure the REST API of your machine learning application to call the host IP and core services port of your WMLz base cluster.

    For cluster type 1, the cluster host IP is the LPAR IP address and the core services port is the SHAREPORT number. For cluster type 2, the cluster host IP is the sysplex distributor IP address and the core services port is the sysplex distributor (SD) port number.

  7. In the event that the active WMLz base instance in your cluster is down, activate the runtime environment of another instance by using the administration dashboard.
    1. Sign into the administration dashboard.
    2. From the sidebar, navigate to the System Management - Runtime Environments page.
    3. Select the runtime environment you want to activate and from the ACTIONS menu, click the Connection icon to connect and activate.
    4. Verify that the new runtime environment is active and the cluster is up and running.