Running a Data Collector engine
After you create a StreamSets environment, run one or more Data Collector engines for the environment.
Each engine runs on a separate workstation in your corporate network. Set up each workstation to meet the engine prerequisites. Then use the command line to run an engine as a container on a container management system such as Docker or Podman.
Run multiple engines for an environment to increase processing capacity or to support high availability. For more information, see Running multiple engines for a single environment.
Prerequisites
Before you run an engine, complete the required prerequisites.
Complete account prerequisites once. You can use the same API keys for all environments and engines in your account.
Complete engine workstation prerequisites on each workstation where you plan to run an engine.
Account prerequisites
Before you run an engine, complete the following prerequisites for your IBM watsonx account. You complete these steps only once and can reuse the API keys for all environments and engines.
Create a user API key (task credentials)
Running StreamSets jobs requires task credentials for secure authorization. Task credentials is a user-generated API key, securely stored in Vault, that facilitates authentication for long-running data integration tasks. For more information, see Creating task credentials for jobs.
Create a cloud account API key
Running a Data Collector engine requires a cloud account API key for secure authorization. You provide this key when you run the engine command.
IBM Cloud To create an API key for watsonx.data integration on IBM Cloud:
- In your watsonx account on IBM Cloud, from the navigation menu, select .
- In the IBM Cloud console, select API keys.
- Click Create.
- Save or download the API key value.
AWS To create an API key for watsonx.data integration on AWS:
- In your AWS account, from the navigation menu, select .
- Click Create.
- Save or download the API key value.
Engine workstation prerequisites
Complete the following prerequisites on every workstation where you plan to run an engine.
Verify minimum system requirements
Verify that the engine workstation meets the following minimum requirements:
| Component | Minimum requirement |
|---|---|
| Operating system | Any Linux distribution |
| Cores | 2 |
| RAM | 4 GB |
| Disk space | 6 GB Note: Do not use NFS or NAS to store Data Collector files.
|
Install a container management system
A Data Collector engine runs as a container on a container management system, such as Docker or Podman.
Install Docker or Podman on the engine workstation.
Configure firewall access
If the engine workstation is behind a firewall, configure outbound access to required systems.
For more information, see Firewall access for StreamSets.
Running the engine command
Use the command line to run a Data Collector engine as a container.
About this task
The procedure that you use to run the engine depends on your engine version and communication method. For more information about the communication methods, see Engine communication.
| Engine version | Communication method and procedure |
|---|---|
| 7.1.0-0115 and later |
|
| 7.1.0 and earlier | Direct - Running the engine with direct communication |
Running the engine with tunneling communication (default)
Use the following procedure to run an engine that uses tunneling communication.
About this task
7.1.0-0115 and later
An engine uses tunneling communication when the Data Collector engine version is 7.1.0-0115 and later and the default tunneling communication method is configured for your account.
Procedure
Running the engine with direct communication
Use the following procedure to run an engine that uses direct communication.
About this task
- The Data Collector engine version is 7.1.0 and earlier.
- The Data Collector engine version is 7.1.0-0115 and later and the direct engine communication method is configured for your account.
Procedure
Running multiple engines for a single environment
Run multiple engines for a single environment to increase processing capacity or to support job failover and high availability.
- Processing capacity increases with each additional engine.
- Jobs can start on any online engine within resource thresholds.
- Jobs are assigned arbitrarily when more than one engine is available.
- When an engine shuts down unexpectedly, jobs can fail over to another available engine, starting from the last-saved offset.
To run multiple engines, set up a separate workstation with the engine prerequisites and then run the engine command on the additional workstation.
Job failover guidelines
6.4 and later
When an environment uses multiple engines, jobs can fail over to another engine if the active engine becomes unavailable. The job restarts on an available engine that is online and within the defined resource thresholds. The job continues from the last-saved offset.
The new engine starts processing from the last saved offset recorded by the previous engine. However, if the job stopped while processing a batch of data, some data might be processed again and can be duplicated. For more information, see Delivery guarantee.
Jobs can fail over when the environment configures Data Collector engine versions 6.4 and later.
A job can failover up to three times. After the job reaches the maximum number of failover retries, the job run fails.
Before you run multiple engines to support job failover and high availability, review the following guidelines:
- Verify that source stages maintain offsets
- Confirm that the source stages in your flows maintain offsets. Most source stages save the offset while processing data, so subsequent job runs continue from the last-saved offset. You can run these jobs on environments with multiple engines.
- Verify that all engines can access source and target systems
- Confirm that all engines in an environment can access the source and target systems that your flows use.
- Configure source system resiliency
- Job failover provides high availability for job processing but not for incoming data. Job failover might take several minutes. To prevent data loss during failover, configure the source system to support job failover.
Customizing the engine command
About this task
You can customize the engine run command to add environment variables or Docker or Podman command options. You edit the StreamSets environment details to customize the command. The StreamSets environment includes the customization when you retrieve the command.
You can add the following information to customize the command:
- Environment variables
- Add environment variables that you want the engine container to use.
- Docker or Podman command options
- Add Docker or Podman run command options.