Start Jobs

The Start Jobs processor starts one or more Control Hub jobs in parallel upon receiving a record. The processor can also start job instances from a job template.

The Start Jobs processor is an orchestration stage that you use in orchestration pipelines. Orchestration stages perform tasks, such as schedule and start pipelines and Control Hub jobs, that you can use to create an orchestrated workflow across StreamSets. For example, an orchestration pipeline can use the Cron Scheduler origin to generate a record every weekday at 9 AM that triggers the Start Jobs processor, which starts a set of Control Hub jobs.

After performing its task, the Start Jobs processor updates the orchestration record, adding details about the jobs that it started. Then, it passes the record downstream. You can pass the record to an orchestration stage to trigger another task. Or, you can pass it to a non-orchestration stage to perform other processing.

When you configure the Start Jobs processor, you specify the Control Hub URL, and the jobs or job template to start. You can also specify runtime parameters for each job or job instance.

You can configure the processor to reset the origins in the jobs when possible, and to run the jobs in the background. When running jobs in the background, the processor immediately updates and passes the input record downstream instead of waiting for the jobs to finish.

You also configure the credentials used to run the job. You can optionally configure properties used to maintain the HTTP connection, to establish an HTTP proxy, and to enable SSL/TLS.

You can also use a connection to configure the processor.

Job execution and data flow

The Start Jobs processor starts the specified jobs upon receiving a record. The processor adds task details to the record and passes it downstream based on how the jobs run:

Run jobs in the foreground: By default, the processor starts jobs that run in the foreground. When the jobs run in the foreground, the processor updates and passes the orchestration record downstream after all the started jobs complete.; Run jobs in the foreground to ensure that the task defined in the subsequent stage is not performed until all of the jobs complete.
Run jobs in the background: You can configure the processor to start jobs that run in the background. When jobs run in the background, the processor updates and passes the orchestration record downstream immediately after starting the jobs.; The processor does not track or provide information about whether the jobs finish successfully.; Run jobs in the background to enable the task defined in the subsequent orchestration stage to occur in parallel with the jobs started by the processor.

Generated record

The Start Jobs processor updates the orchestration record that it receives with information about the jobs that it starts.

The processor adds the following fields:

Field Name Description

Field Name	Description
<unique task name>	List Map field within the `orchestratorTasks` field of the record. Contains the following subfields: jobIds - IDs of the jobs that the stage is configured to start. jobResults - List Map of job information for each job. success - Boolean field that indicates whether all jobs completed successfully. Included only when the jobs run in the foreground.
<job ID>	List Map field within the `jobResults` field that provides details about each job. Contains the following fields: jobId - ID of the job. startedSuccessfully - Boolean field that indicates whether a job was started successfully. jobStatus - Status of the job. For more information see Job Status . jobStatusColor - Status color of the job. For more information see Job Status. errorMessage - Error message associated with the job. finishedSuccessfully - Boolean field that indicates whether a job completed successfully. Contains the following field: jobMetrics - Map field that contains job metrics for the pipeline and individual pipeline stages. Included only when the job runs in the foreground.

List Map field within the orchestratorTasks field of the record. Contains the following subfields:

jobIds - IDs of the jobs that the stage is configured to start.
jobResults - List Map of job information for each job.
success - Boolean field that indicates whether all jobs completed successfully.
Included only when the jobs run in the foreground.

List Map field within the jobResults field that provides details about each job. Contains the following fields:

jobId - ID of the job.
startedSuccessfully - Boolean field that indicates whether a job was started successfully.
jobStatus - Status of the job. For more information see Job Status .
jobStatusColor - Status color of the job. For more information see Job Status.
errorMessage - Error message associated with the job.
finishedSuccessfully - Boolean field that indicates whether a job completed successfully. Contains the following field:
- jobMetrics - Map field that contains job metrics for the pipeline and individual pipeline stages.
Included only when the job runs in the foreground.

For example, the following preview shows information provided by a Start Jobs processor with the start load job task name:

Note that the job status and colors indicate that the jobs are running at the time that the processor creates the record. There is no finishedSuccessfully field because the jobs have not yet completed.

For an example of a full orchestration record, see ../Orchestration_Pipelines/OrchestrationPipelines_Title.html#concept_x43_wlc_zlb__section_qtk_mlq_zlb.

Suffix for job instance names

For job instances created or started from a job template, Control Hub appends a suffix to uniquely name each job instance.

The suffix is added to the job template name after a hyphen, as follows:

<job template name> - <suffix>

Select one of the following methods to generate the suffix:

Counter

Control Hub appends a number to the job template name. For example, job instances created from the Web Log Collection Job are named as follows:

Web Log Collection Job - 1
Web Log Collection Job - 2

Timestamp

Control Hub appends a timestamp indicating when the job instance is started to the job template name. For example, job instances created from the Web Log Collection Job are named as follows:

Web Log Collection Job - 2021-10-22
Web Log Collection Job - 2021-10-23

Use a timestamp for the suffix when you plan to create and start job instances from the template at different times. For example, if you plan to start a single job instance every 24 hours. If you start multiple job instances from the template at the same time, Control Hub appends the same timestamp to each job instance name.

Parameter Value

Control Hub appends the value of the specified parameter to the job template name. For example, job instances created from the Web Log Collection Job are named as follows:

Web Log Collection Job - /server1/logs
Web Log Collection Job - /server2/logs

Runtime parameters for jobs

When you configure the Start Jobs processor to start job instances from templates, you must specify the runtime parameters for each job instance that you want the processor to start. You can also specify runtime parameters when you configure the processor to start jobs.

You can use functions from the StreamSets expression language to define parameter values.

When you configure runtime parameters in the Start Jobs processor, you must enter the runtime parameters as a JSON object, specifying the parameter names and values as key-value pairs. The parameter names must match runtime parameters defined for the pipeline that the job runs.

The format that you use differs depending on whether you are specifying parameters for a job or job instance:

Format for jobs

When configuring runtime parameters for a job, you specify one JSON object with all of the parameters that you want to define.

Use the following format to specify runtime parameters for jobs:

{
      "<parameter name>": <numeric value>,
      "<parameter name>": "<string value>"
}

Format for job instances

When configuring runtime parameters for a job template, you specify one JSON object for each job instance that you want the processor to run.

Use the following format to specify runtime parameters for job instances:

[
   {
      "<param1>": <param numeric value>,
      "<param2>": "<param string value>"
   },
   {
      "<param1>": <param numeric value>,
      "<param2>": "<param string value>"
   }
]

For example, you might enter the following to configure the processor to start two job instances with different values for directory parameters:

[
   {
      "FileDir": "/server1/logs",
      "ErrorDir": "/server1/errors"
   },
   {
      "FileDir": "/server2/logs",
      "ErrorDir": "/server2/errors"
   }
]

Configuring a Start Jobs processor

Configure a Start Jobs processor to start Control Hub job upon receiving a record. The Start Jobs processor is an orchestration stage that you use in orchestration pipelines.

In the Properties panel, on the General tab, configure the following properties:

General Property	Description
Name	Stage name.
Description	Optional description.
Required Fields	Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses. Records that do not include all required fields are processed based on the error handling configured for the pipeline.
Preconditions	Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
On Record Error	Error record handling for the stage: Discard - Discards the record. Send to Error - Sends the record to the pipeline for error handling. Stop Pipeline - Stops the pipeline.

On the Job tab, configure the following properties:

Job Property	Description
Connection	Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Task Name	Name for the task to perform. The name must be unique within the pipeline. The task name is used to group the data that the stage adds to the generated record. Tip: Avoid using spaces in the task name if you want to access information added to the record, such as a job status, from within the pipeline.
Control Hub URL	URL to the Control Hub instance that runs jobs: For Control Hub cloud, enter `https://cloud.streamsets.com`. For Control Hub on-premises, enter the URL provided by your system administrator. For example, `https://<hostname>:18631`.
Job Template	Starts one or more job instances from a defined job template.
Job Inherit Permissions	Job instances inherit the permissions assigned to the job template. Available when starting jobs from a job template. This property is not supported by Legacy Data Collector.
Job Template ID	ID of the job template to start. Available when starting jobs from a job template.
Instance Name Suffix	Method for generating a suffix to uniquely name each job instance: Counter Timestamp Parameter Value Available when starting jobs from a job template.
Job Instance Runtime Parameters	Runtime parameters and values, specified as a JSON object for each job instance. The stage starts a job instance in parallel for each defined JSON object. Use the following format: `[ { "<param1>": <param numeric value>, "<param2>": "<param string value>" }, { "<param1>": <param numeric value>, "<param2>": "<param string value>" } ]` Available when starting jobs from a job template.
Jobs	List of jobs to start in parallel. Available when not starting a job instance from a template. For each job, configure the following properties: Identifier Type - Information used to identify the job. Select Job ID or Job Name. Search Mode - Method to use to search for job names: Equals - Includes jobs with names that exactly match the text entered in the Identifier property. Starts with - Includes jobs with names that start with the text entered in the Identifier property. For example, if you enter `Logs`, the stage starts the LogsToKafka and LogsToKinesis jobs. The stage starts all jobs that match the criteria. Available when Job Name is selected for Identifier Type. Identifier - ID or name of the job. To find the job ID in Control Hub, in the Jobs view, expand the job and click Show Additional Info. Runtime Parameters - Runtime parameters for the job to use. Use the following format: `{ "<parameter name>": <numeric value>, "<parameter name>": "<string value>" }` To include another job, click the Add icon. You can use Simple and bulk edit mode to specify jobs.
Reset Origin	Resets the origin before starting a job, if the origin can be reset. For a list of origins that can be reset, see reset the origin.
Run in Background	Runs the started jobs in the background. When running jobs in the background, the stage passes the orchestration record downstream immediately after starting the jobs. By default, the stage runs jobs in the foreground, passing the record downstream only after all started jobs complete.
Status Check Interval	Milliseconds to wait between checks for the completion status of the started jobs. Available when running jobs in the foreground.

On the Credentials tab, configure the following properties:

Credentials Property	Description
Authentication Type	Method for specifying authentication details: User & Password (SCH 3.x only) - Use when Data Collector is registered with Control Hub cloud or Control Hub on-premises version 3.x. API User Credentials - Use when Data Collector is deployed from Control Hub in StreamSets.
User Name	Control Hub user that starts the jobs. Enter in the following format: `<ID>@<organization ID>` Available when Authentication Type is set to User & Password.
Auth ID	ID of a Control Hub API credential for someone authorized to run the job. For information on creating a Control Hub API credential, see the Control Hub documentation. Available when Authentication Type is set to API User Credentials.
Password	Password for the specified Control Hub user or the token for the specified Control Hub API credential. Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.

On the HTTP tab, configure the following properties:

HTTP Property	Description
Use Proxy	Enables using an HTTP proxy to connect to Control Hub.
Connection Timeout	Maximum number of milliseconds to wait for a connection.
Read Timeout	Maximum number of milliseconds to wait for data.
Max Number Of Tries	Maximum number of times to submit a request to Control Hub.
Retry Interval	Number of milliseconds to wait before resubmitting a request to Control Hub. Increase as needed to ensure that the request is successful.

To use an HTTP proxy, on the Proxy tab, configure the following properties:

Proxy Property	Description
Proxy URI	Proxy URI.
Username	Proxy user name.
Password	Proxy password. Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.

To use SSL/TLS, click the TLS tab and configure the following properties.

TLS Property	Description
Use TLS	Enables the use of TLS.
Use Remote Truststore	Enables loading the contents of the truststore from a remote credential store or from values entered in the stage properties. For more information, see Remote keystore and truststore.
Trusted Certificates	Each PEM certificate used in the remote truststore. Enter a credential function that returns the certificate or enter the contents of the certificate. Using simple or bulk edit mode, click the Add icon to add additional certificates.
Truststore File	Path to the local truststore file. Enter an absolute path to the file or enter the following expression to define the file stored in the Data Collector resources directory: `${runtime:resourcesDirPath()}/truststore.jks` By default, no truststore is used.
Truststore Type	Type of truststore to use. Use one of the following types: Java Keystore File (JKS) PKCS #12 (p12 file) Default is Java Keystore File (JKS).
Truststore Password	Password to the truststore file. A password is optional, but recommended. Tip: To secure sensitive information such as passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
Truststore Trust Algorithm	Algorithm to manage the truststore. Default is SunX509.
Use Default Protocols	Uses the default TLSv1.2 transport layer security (TLS) protocol. To use a different protocol, clear this option.
Transport Protocols	TLS protocols to use. To use a protocol other than the default TLSv1.2, click the Add icon and enter the protocol name. You can use simple or bulk edit mode to add protocols. Note: Older protocols are not as secure as TLSv1.2.
Use Default Cipher Suites	Uses a default cipher suite for the SSL/TLS handshake. To use a different cipher suite, clear this option.
Cipher Suites	Cipher suites to use. To use a cipher suite that is not a part of the default set, click the Add icon and enter the name of the cipher suite. You can use simple or bulk edit mode to add cipher suites. Enter the Java Secure Socket Extension (JSSE) name for the additional cipher suites that you want to use.