Using IBM Spectrum LSF Predictor

This document gives an outline about how to typically use the LSF Predictor service.

About this task

The following list contains the basic concepts associated with the Predictor service:

Experiment: An experiment is an LSF simulation run that uses a selected LSF configuration and workload snapshot.

Cluster configuration: An LSF configuration is a full set of LSF cluster configurations and workload policies.

Workload snapshot: A workload snapshot is a set of job submissions and completion records that are imported from the LSF cluster events files (lsb.events*).

Prediction: An AI model training process that includes data selecting, data cleaning, starting and stopping training, pipeline viewing and publishing, model testing, workload snapshot optimizing, and more.

Procedure

The following is an example outline of a typical Predictor service workflow by using samples that are provided for you on which you can practice and learn.

Workflow example 1:: Start by optimizing the sample data and then rerunning a sample experiment. You can then compare the results of the original (baseline) and the new experiment. This example does not require IBM Cloud Pak for Data to be running.

Select the Workload Snapshots tab.
In the WL_clusterA_small row, click the Optimize action.
Select the default sample_model_max_mem.model (local) prediction model. Click Optimize.
Select the Experiments tab.
In the Sample Experiment row, click the Rerun action.
In the Modify and Rerun Experiment wizard, select the Workload Snapshot tab. Select the newly-generated workload snapshot. You can click Next to review your experiment selections. Click Rerun. The progress of the Sample Experiment job can be tracked. Click Refresh to update the progress bar in the table. The experiment is completed when the progress bar shows 100% jobs are completed.
To view the prediction results charts, click on the Sample Experiment name and select the Prediction Results tab.

Workflow example 2:: Use a small job data set to practise how to create a model with the LSF Predictor.; Train a model to predict the amount of memory required for a job.
Note: This example does require IBM Cloud Pak for Data to be running.

To create a prediction regarding the amount of memory required for a job, select the Prediction tab.
Click Create +. Enter a prediction name such as Job_memory. Leave the default selections of max-mem prediction target and Regression prediction type. You can add a description of the prediction if you want. Click Next.
To select historical job data in lsb.acct files, enter a data source location and your LSF cluster name.
Select the workload start and end times. Click Next.
Select job attributes for the model training data under the Job Features tab.
Previews of the data update dynamically as you make changes to the job attributes and filters.
To start the model training, click Create.

The model training time depends on the amount of the training data. For example, training with 10000 jobs and 6 attributes takes about 20 minutes.

To display the pipeline details when the model training completes, click on the prediction name, then the Pipelines tab.
To publish the pipeline as a model, locate the pipeline with rank 1 and click Publish.
Test the model interactively by clicking Test.
To discover the predicted job memory size, enter job attributes and click Predict.

Workflow example 3:: Tune your model to improve prediction accuracy.

Click the completed prediction name, check the MAE (for Regression type) or Accuracy (for Classification type) value for each pipeline of the prediction.
If the MAE value is too big relative to the real job memory, or the Accuracy value is too small, it's time to tune the model.
To tune the model, the important task is to find the most relevant job features.
The feature importance column in the pipeline table shows the importance number as determined from the last training run. Select the job features with the largest importance numbers for the next tuning practise run.
Click the Tune action to start tuning the prediction.
Carefully select the important job features. The following general rules can result in better predictions:
1. Find the relevant information using a tag from the long string instead of the whole string.
2. Remove the irrelevant job features.
3. Use a smaller job data set to reduce the tuning-process time.
  It's advisable to select the number of jobs to be less than 100k for tuning.
After the tuning is done, recheck the MAE or Accuracy value.
If the value is still not good enough, return to step 12 and tune it again.
If the MAE or Accuracy value is good enough, apply the features selection to a large data set to improve the accuracy of the prediction.
Create a new prediction or tune an existing prediction by selecting a large number of jobs. However, based on benchmark results, when the number of jobs is too big, for example 10 million, it does not result in an improvement in model accuracy.

Results

To deploy the model to your LSF production environment, see Deploying models to IBM Spectrum LSF production environment.