Tivoli Workload Scheduler for z/OS: Activating, configuring, and running the Dynamic Critical Path (Workload Service Assurance)
KatB 0600023DBB Visits (10012)
Activating and configuring Dynamic Critical Path
The Dynamic Critical Path feature is entirely managed by the Controller and Daily Plan Batch tasks. There are no other requirements. The only times that you are asked to specify for the (critical) operation are the deadline and the estimated duration times. After that, the feature automatically calculates and
adjusts the intermediate times to adhere to the defined times without requiring any further manual intervention. To activate Dynamic Critical Path, leave to YES the CRITJOB keyword of the JTOPTS initialization statement of the Controller. The default value of this keyword is YES, so just make sure it has not been changed. See Tivoli Workload Scheduler for z/OS: Customization and Tuning for information on all the parameters quoted in this document. You must first decide which jobs are critical for the operation of your business. For these jobs specify P in the CRITICAL field of the Automatic Options (panel EQQAMJBP) definition in ISPF (this is possible also on the Dynamic Workload Console). This action defines each job as the target of a critical path.
Next, extend your current plan so that your critical jobs are included in the plan. In this way, when the daily plan is processed, a critical path including the internal and external predecessors of each critical job is calculated, and a specific dependency network is created. While the plan is running, the scheduler monitors the critical paths that are consuming their slack time. When a predecessor that is in a critical path is near a late condition (has not been started yet and its late start time is approaching), its execution is accelerated so that the critical job is not delayed. Finally, you must configure WLM (Long Duration policy and class) to make sure that also operations in a critical path that are past their estimated duration time are dynamically promoted to different WLM classes. This is discussed in the next section.
Setting up for promotion on WLM
Use the WLM keyword of the OPCOPTS initialization statement to specify by which policy a job is rated to be running late and to which WLM service class late jobs are to be promoted. For example, one of the policies, DURATION, determines that a job is running late when it exceeds its expected duration (as specified in the current plan). When this happens, the scheduler moves the job to the higher-performance service class. Note that in the discussion of the OPCOPTS WLM keyword in Customization and Tuning the term critical is used to qualify all jobs in need of promotion. In WLM terminology the term is used to generally describe all jobs that are considered to be "important", not jobs that are within a critical path. You can specify a WLM class and policy also for individual operations as shown in this figure:
WLM class promotion can be applied to the following operations
- Operations that are defined with W in the CRITICAL field (regardless of critical path; the process takes place even if the Dynamic Critical Path feature is not active).
- Operations that are defined with P in the CRITICAL field.
- Operations that are in the critical path of a critical job. The critical path is calculated by the scheduler starting from the job that was defined as critical. All the operations that are in a critical path are automatically eligible for WLM class promotion.
WLM class to which an operation is promoted
- Operations defined with W or P in CRITICAL are promoted to the specified class. If no class was specified, they are promoted to the global class (defined in OPCOPTS as default class).
- Operations belonging to the critical path are promoted to the specified class. If no class was specified, they are promoted to the class specified for the critical job. If no class was specified for the critical job either, they are promoted to the global class (defined in OPCOPTS as default class).
Choosing a criterion to configure the Long Duration
Long Duration generally corresponds to the condition where the runtime of a job exceeds is estimated duration. You can however choose other criteria to define this term. For example, you can define Long Duration to happen when the runtime of the job extends beyond twice its estimated duration.
Use the ALEACTION keyword of the JTOPTS statement to set the definition of Long Duration. The keyword includes two values. The first value specifies the criterion by which the duration of an operation goes beyond its limit (becomes long). The second value is the so-called tolerance time: the time (in seconds) that the scheduler must wait to take action on the job after the long duration condition is detected. For example, if you specify ALEACTION(500,060), and the planned duration of a job is equal to 10 seconds, that job is considered to have a long duration when its actual duration becomes at least 50 seconds long. Because of the second option, there are 60 seconds before action is taken on the job.
Another fact to consider about the time spent before action is taken on a late job is that if the workstation analyzer (the controller subtask in charge of job scheduling) is idle, it wakes up every 2 minutes to check for new work and for late jobs. It may so happen that a long duration condition is not detected immediately, but as soon as workstation analyzer wakes up.
Understanding how critical paths are monitored
When you qualify a job as critical, the entire network of its predecessors is added to a monitoring table of the Dynamic Critical Path feature.
The reason for monitoring the entire critical path network is that the timely execution of a critical job is threatened by anything that happens to its predecessors. When a potential delay takes place in this network, the feature detects it and triggers an alert.
The following conditions are considered as potential threats for the completion of a critical job before its deadline:
- A late condition (a predecessor did not start running before its latest start time)
- A long duration condition
- An error condition (a predecessor terminated in error status)
When any of these conditions occurs, message EQQCP20 is issued in the controller MLOG and in the system log to communicate that the risk level of the critical job is switched to potential. Furthermore, the operation that caused this condition is included in a hot list for the critical job. The notification of this information enables an operator to step in (for example if a predecessor ended in error) and put things straight before the delay introduced in the network adds up to put the timely execution of the critical job at higher risk.
If the problem is solved and the delay is absorbed, the risk level returns to normal and this is logged in another message. If the delay continues to increase and becomes a threat to the completion of the critical job before its deadline, message EQQCP21 is issued in the controller MLOG and in the system log. In this
case the risk level changes to high (H). The following figure shows the view of a critical job at high risk.
Configuring DCP for best performance
The decision to use Dynamic Critical Path in your environment is certainly based on the premise that you need some of your jobs to complete within their deadlines. This implies that you are concerned with job deadlines only when you have critical jobs. But Tivoli Workload Scheduler for z/OS versions before 9.1 do not allow you not to specify a deadline when you define an application. You must therefore specify a deadline also in cases when it is not of particular significance.
The deadline is essential to the calculation of the latest start time. If an operation failed to start within its latest start time, it ends up being in a late condition. A badly or inadequately defined deadline can lead to the false late condition of an operation. If the operation happens to be a predecessor to a critical job, it can cause a false potential risk alarm and confuse the operator about the situation.
If you use Dynamic Critical Path, it is important that the late conditions on the networks be based on the deadlines of the critical jobs. In this case, a late condition in the network becomes a real potential risk and does not create confusion. For this reason, it is imperative that the IGNOREDEADL keyword of the BATCHOPT initialization statement be set to YES. This setup causes the deadlines of all operations that are neither critical nor suppress-if-late to be pushed forward beyond the end of the plan so that they do not affect the calculations of time and of late conditions.
One other important consideration must be made on the time accuracy on which Dynamic Critical Path is based. To evaluate the risk level of a critical job and to calculate its critical path the feature uses two (automatically calculated) parameters: the estimated start and end times. They are an estimate of when an operation will start and when it will complete. For example, a high risk level matches the condition where the estimated end time is later than the deadline. This implies that the critical job will probably not complete by its deadline.
Every time the plan is created, a new monitoring table is produced with the critical jobs and their predecessors, and the relevant times are worked out. The estimated start and end times are derived from the planned start and end times. The planned start and end times are determined from the daily plan, as
an outcome of a simulated scheduling that the batch program runs while creating the plan.
The planned times are the starting values for the estimated times. As the plan runs, these values are adjusted each time a delay occurs in the critical path, so that the risk that a critical job does not meet its deadline is constantly under watch.
The estimated times are also used to determine the critical path starting from the critical job. To help the daily plan compute these times as precisely as possible, set the TIMEDEPCHK keyword of the BATCHOPT initialization statement to YES. If you fail to do so, the daily plan in its simulated scheduling treats all the operations as time-dependent and determines the planned start time from the input arrival time. If you set TIMEDEPCHK to YES, the computation is more exact because, with no time dependence, it is not conditioned by the input arrival time. To sum up, to get the best out of the Dynamic Critical Path feature remember to set IGNOREDEADL and TIMEDEPCHK to YES.