Running a Data Collector engine
After you create a StreamSets environment, run one or more Data Collector engines for the environment.
Each engine runs on a separate workstation in your corporate network. Set up each workstation to meet the engine prerequisites. Then use the command line to run an engine as a container on a container management system such as Docker or Podman.
Run multiple engines for an environment to increase processing capacity or to support high availability. For more information, see Running multiple engines for a single environment.
Prerequisites
Before you run an engine, complete the required prerequisites.
Complete account prerequisites once. You can use the same API key for all environments and engines in your account.
Complete engine workstation prerequisites on each workstation where you plan to run an engine.
Account prerequisites
Before you run an engine, complete the following prerequisites for your IBM watsonx account. You complete these steps only once and can reuse the API key for all environments and engines.
Create an API key
Running a Data Collector engine requires an API key for secure authorization. You provide this key when you run the engine command.
If you already have an active API key, use the existing key.
If you do not have an active API key, click your avatar and select Profile and settings to open your account profile. Then click .
Engine workstation prerequisites
Complete the following prerequisites on every workstation where you plan to run an engine.
Verify minimum system requirements
Verify that the engine workstation meets the following minimum requirements:
| Component | Minimum requirement |
|---|---|
| Operating system | Any Linux distribution |
| Cores | 2 |
| RAM | 4 GB |
| Disk space | 6 GB Note: Do not use NFS or NAS to store Data Collector files.
|
Install a container management system
A Data Collector engine runs as a container on a container management system, such as Docker or Podman.
Install Docker or Podman on the engine workstation.
Running the engine command
Use the command line to run a Data Collector engine as a container.
About this task
- Tunneling (default) - Running the engine with tunneling communication (default)
- Direct - Running the engine with direct communication
Running the engine with tunneling communication (default)
Use the following procedure to run engines with tunneling communication. Engines use tunneling communication when your account uses the default tunneling communication method.
About this task
Procedure
Running the engine with direct communication
Use the following procedure to run engines with direct communication. Engines use direct communication when an administrator for your account has switched to the direct communication method.
About this task
Procedure
Running multiple engines for a single environment
Run multiple engines for a single environment to increase processing capacity or to support job failover and high availability.
- Processing capacity increases with each additional engine.
- Jobs can start on any online engine within resource thresholds.
- Jobs are assigned arbitrarily when more than one engine is available.
- When an engine shuts down unexpectedly, jobs can fail over to another available engine, starting from the last-saved offset.
To run multiple engines, set up a separate workstation with the engine prerequisites and then run the engine command on the additional workstation.
Job failover guidelines
When an environment uses multiple engines, jobs can fail over to another engine if the active engine becomes unavailable. The job restarts on an available engine that is online and within the defined resource thresholds. The job continues from the last-saved offset.
The new engine starts processing from the last saved offset recorded by the previous engine. However, if the job stopped while processing a batch of data, some data might be processed again and can be duplicated. For more information, see Delivery guarantee.
A job can fail over up to three times. After the job reaches the maximum number of failover retries, the job run fails.
Before you run multiple engines to support job failover and high availability, review the following guidelines:
- Verify that source stages maintain offsets
- Confirm that the source stages in your flows maintain offsets. Most source stages save the offset while processing data, so subsequent job runs continue from the last-saved offset. You can run these jobs on environments with multiple engines.
- Verify that all engines can access source and target systems
- Confirm that all engines in an environment can access the source and target systems that your flows use.
- Configure source system resiliency
- Job failover provides high availability for job processing but not for incoming data. Job failover might take several minutes. To prevent data loss during failover, configure the source system to support job failover.
Customizing the engine command
About this task
You can customize the engine run command to add environment variables or Docker or Podman command options. You edit the StreamSets environment details to customize the command. The StreamSets environment includes the customization when you retrieve the command.
You can add the following information to customize the command:
- Environment variables
- Add environment variables that you want the engine container to use.
- Docker or Podman command options
- Add Docker or Podman run command options.