Problems when you run multiple instances of a job from a job sequence, or from a script that uses dsjob.

A number of related problems can occur when you are running multiple instances of a job from job sequence, or a script that sequences jobs with the dsjob command.

Symptoms

  • Multiple job instances run from a sequence or a script, and the sequence reports that status=99 for one or more of the job instances.
  • Multiple job instances run from a sequence or a script, and the job instances take a long time to start and to finish.
  • More than 25 job instances run from a sequence or a script, and the sequence reports that status=99 for one or more of the job instances.
  • The system does not have enough resources because of a heavy work load, and the sequence reports a code=-99 error for a parallel job.
  • On Intel RedHat and Suse systems, jobs can hang despite having successfully run the underlying OSH code.
  • Some jobs run with missing parameters, or parameters that are erroneously set to default values.

Resolving the problem

For release 8.0.x /8.1:

  • If your system uses RedHat, Suse, or Intel, install JR30015v5.
  • If your system does not use RedHat, Suse, or Intel, install JR30015v6 on the server.
  • If auto-purge is enabled, and more than 25 instances of a job are run simultaneously, then the JR30015v3 client patch must be installed on all client systems.
  • Recompile all parallel jobs after you install the patch or fix pack.

For release 8.1:

  • Install fix pack 1.
  • Recompile all parallel jobs after you install the patch or fix pack.

The fix introduces the following optional capabilities:

Environment variable: DSWaitResetStartup

When multiple instances of a job are run from a sequence, and one or more of the job instances are set to reset, the sequence might report that status=99. This can occur because the controlling sequence did not give the job instances enough time to reset before polling its status. The startup time for a job reset must be increased. The environment variable DSWaitResetStartup can be used for this purpose. The maximum value that can be set for DSWaitResetStartup is equal to the value of DSWaitStartup, which is 60 by default. For example, if a value of 120 is required for DSWaitResetStartup, you must ensure that DSWaitStartup is also set to a minimum of 120.

Environment variable: DS_NO_INSTANCE_PURGING

If the system is under extreme load, it might be necessary to use the DS_NO_INSTANCE_PURGING environment variable if Status=99 errors still occur when you run many multi-instance jobs and auto-purge is enabled. This environment variable must be set to 1. This stops the auto-purge from deleting the status records for the job instance, allowing the controlling job to read its status when system resource becomes available. (In other situations, you might want clean logs with no persistent instance entries, so the default behavior is to purge instance entries.)

Client change, and environment variable DSJobStartedMax

The number of recorded instance identifiers increased to 100 from a value of 25. The increase prevents status records from being purged when more than 25 instances run simultaneously. If you use N-instance auto-purging and run more than 25 simultaneous instances, then the N-Instance auto-purge limit must be set to more than 25. The limit is set in the Director or Administrator clients. If you must run more than 100 Instances simultaneously, then the environment variable DSJobStartedMax must be set to the required value. The maximum value is 9999. The APAR number for this issue is APAR JR30015.