Planning system capacity for WML for z/OS

WML for z/OS performs the best when adequate system capacity in terms of server, processor, memory, and disk space is available. Allocate sufficient capacity for WMLz on your z/OS system to meet the demands of your enterprise machine learning workload.

Basic system capacity

WMLz requires the following basic system capacity:

Table 1. Minimum system capacity for WMLz
Required for Hardware Number of LPAR/Server CPU
(Per LPAR/Server)
Memory (GB)
(Per LPAR/Server)
DASD/Disk Space (GB)
(Per LPAR/Server)
Installing and running WMLz on z/OS1 Z system 1 LPAR2 4 zIIPs
1 GCP
100 GB 100 GB3
Notes:
  • 1 While you can use the basic system capacity to run any reasonable workload, the rule of thumb is that the heavier the workload is, the more capacity you should allocate. To maximize the value of your WML for z/OS, allocate the right system and the right capacity based on the actual workload.
  • 2 For best performance, consider dedicating the LPAR to WML for z/OS.
  • 3 Allocate more storage space if you plan to enable and configure Db2® anomaly detection solution. The anomaly detection service runs on the same LPAR where your WMLz runs and shares the same system resources. The larger the SMF data set, the output CSV file, and SMF in-memory data are, the more resources these services and processes demand. In some cases, it's recommended that you allocate an additional 100 GB disk storage per LPAR for the anomaly detection service.

Capacity consideration for training workload on Z

Models that are created with the integrated Notebook Editor are trained on Z systems. If most of your models are notebooks, consider adjusting Z system capacity to meet your training workload needs.

A training workload is typically defined by the following factors:

  • Number of concurrent training jobs that are run by the Jupyter notebook. Each job includes the tasks for data loading, data transformation, data visualization, feature transformation, model fitting, and model evaluation.
  • Type of models. The WML for z/OS training service on z/OS supports SparkML, Scikit-learn, ARIMA or Seasonal ARIMA, and XGBoost models.
  • Size of training data set (GB).
  • Number of data features.
  • Specification of hyperparameters for fine-tuning the defined algorithm or pipeline and improving the accuracy of a model.

As the following examples show, these factors can individually or collectively impact your training workload and thus affect the performance of WML for z/OS.

Example 1: Suppose that you allocate 4 zIIPs and 1 GCP to run concurrent jobs of training zIIP-eligible SparkML models. The models are Scala notebooks and trained with a 2 GB data set that contains 100 features. For the training workloads, you also configure z/OS Spark to run in the standalone cluster mode with spark.cores.max set to 2 and SPARK_WORKER_CORES to 200. The following table is a snapshot of results from internal tests with this system capacity and Spark configuration. The results show a clear correlation among the size of training workload, the availability of system capacity, and the performance of the WML for z/OS training service. The performance is measured by system response time in minutes.

Z system capacity
(CPU)
Number of concurrent training jobs Size of data set (GB) Number of data features
8 16 32 64
4 zIIPs
1 GCP
13 27 56 135 2 100
21 46 94 238 4 200
8 zIIPs
1 GCP
9 15 31 68 2 100
14 21 44 101 4 200
  System response time (minutes)    

For example, it takes the training service 13 minutes to complete 8 concurrent jobs. But that response time deteriorates significantly as the training workload increases, regardless of it being the number of concurrent jobs, the size of the data set, or the number of data features. When the session concurrency doubles to 16, system response time goes up at almost the same rate to 27 minutes. That time increases even more, up to 46 minutes, when the data set grows to 4 GB with 200 features. However, system response time improves dramatically when more Z capacity is allocated to handle the same training workloads. For example, when 8 zIIPs are allocated to run the same 8 concurrent jobs, it takes the system only 9 minutes to train the same SparkML models.

While the actual results of your system performance can vary, the example clearly indicates the positive correlation among training workload, system capacity, and system response time. After you determine the required Z system capacity, update your z/OS Spark configuration, including spark.cores.max, SPARK_WORKER_CORES, and other properties in the spark-defaults.conf and spark-env.sh files.

Example 2: While SparkML models are trained on zIIP processors, Scikit-learn models are generally processed on general processors (GCPs). If your training workload consists primarily of Scikit-learn models, allocate an appropriate number of GCPs for the LPAR where the training service runs.

Suppose that you allocate 8 GCPs and 1 zIIP to run concurrent jobs of training Scikit-learn models. The models are Python notebooks and trained with a 2 GB data set that contains 100 features. As shown in the following table, internal tests show the same positive correlation between the training workload and the capacity allocation for the workload and system response time. System response time is directly impacted by the number of concurrent jobs and the system capacity allocated to handle those jobs.

Z system capacity
(CPU)
Number of concurrent training jobs Size of data set (GB) Number of data features
1 4 8 16 32
4 GCPs
1 zIIP
7 8 15 35 72 2 100
8 GCPs
1 zIIP
7 8 9 16 36 2 100
  System response time (minutes)    

With 8 GCPs and 1 zIIP, it takes the system 9 minutes to process 8 concurrent jobs and 36 minutes for 32 concurrent jobs, indicating that workload and system response time grow at almost an identical rate. But when the number of GCPs is slashed to 4, system response time increases to 15 minutes and 72 to complete the same workloads of 8 and 32 concurrent jobs. Results from repeated tests indicate that optimal system response time and performance can be achieved by allocating one GCP per job of training Python Scikit-learn models. Sharing a GCP by multiple concurrent jobs only when the ideal allocation is not obtainable.

Example 3: The preceding examples clearly demonstrate the positive correlation among the training workload, Z processor used for the workload, and system response time. The same tests in those examples show a similar correspondence between the training workload and the Z memory allocated for handling the workload. For the tests, a total of 768 GB of memory is allocated along with the minimum of CPU required for training Scikit-learn or SparkML models. The models are trained with a 2 GB data set that contains 100 features.

For training SparkML models, about 42 out of 768 GB of memory is used for running 8 concurrent jobs but 74 GB is consumed when the workload grows to 16 concurrent sessions. Test results also show that approximately 10 GB of memory is consistently used for preparing the system for each workload and 4 GB for processing each job in the workload. So, of the 42 GB of memory used for the workload of 8 concurrent jobs, roughly 10 GB is used for system preparation (overhead) and 32 GB (4 GBx8) for data handling.

For training Scikit-learn models, roughly 120 GB and 240 GB of memory is used to complete the workload of 8 and 16 concurrent jobs. On average, around 15 - 18 GB of memory is required to handle each job in a Scikit-learn workload.