Running a Data Collector engine

After you create a StreamSets environment, run one or more Data Collector engines for the environment.

Each engine runs on a separate workstation in your corporate network. Set up each workstation to meet the engine prerequisites. Then use the command line to run an engine as a container on a container management system such as Docker or Podman.

Run multiple engines for an environment to increase processing capacity or to support high availability. For more information, see Running multiple engines for a single environment.

Complete the following tasks to run a Data Collector engine:

Prerequisites

Before you run an engine, complete the required prerequisites.

Complete account prerequisites once. You can use the same API key for all environments and engines in your account.

Complete engine workstation prerequisites on each workstation where you plan to run an engine.

Account prerequisites

Before you run an engine, complete the following prerequisites for your IBM watsonx account. You complete these steps only once and can reuse the API key for all environments and engines.

Create an API key

Running a Data Collector engine requires an API key for secure authorization. You provide this key when you run the engine command.

If you already have an active API key, use the existing key.

If you do not have an active API key, click your avatar and select Profile and settings to open your account profile. Then click API key > Generate new key.

Engine workstation prerequisites

Complete the following prerequisites on every workstation where you plan to run an engine.

Verify minimum system requirements

Verify that the engine workstation meets the following minimum requirements:

Component	Minimum requirement
Operating system	Any Linux distribution
Cores	2
RAM	4 GB
Disk space	6 GB Note: Do not use NFS or NAS to store Data Collector files.

Install a container management system

A Data Collector engine runs as a container on a container management system, such as Docker or Podman.

Install Docker or Podman on the engine workstation.

Running the engine command

Use the command line to run a Data Collector engine as a container.

About this task

The procedure that you use to run the engine depends on the engine communication method:

Tunneling (default) - Running the engine with tunneling communication (default)
Direct - Running the engine with direct communication

For more information about the communication methods, see Engine communication.

Running the engine with tunneling communication (default)

Use the following procedure to run engines with tunneling communication. Engines use tunneling communication when your account uses the default tunneling communication method.

About this task

Important: Complete the required prerequisites before you run the engine command.

Procedure

In a UNIX shell such as Bash, export the API key that you created for the engine as a prerequisite:
```
export SSET_API_KEY=<api_key>
```
Retrieve the engine command from the StreamSets environment.
1. On the Manage tab of your project, click the StreamSets tool.
2. From the environment Options icon , click Get run command.
3. Click the Copy to Clipboard icon.
Paste the copied command into the command prompt.
If you use Podman instead of Docker, change docker to podman in the command.
Run the engine command.

When the engine successfully starts, the command prompt displays the engine container ID.

Running the engine with direct communication

Use the following procedure to run engines with direct communication. Engines use direct communication when an administrator for your account has switched to the direct communication method.

About this task

Important: Complete the required prerequisites before you run the engine command.

Procedure

In a UNIX shell such as Bash, export the API key that you created for the engine as a prerequisite:
```
export SSET_API_KEY=<api_key>
```
Verify that the following command returns a valid hostname for the engine workstation:
```
echo $(hostname)
```
If the command does not return a valid result, then you must customize the copied engine command to specify the hostname.
Retrieve the engine command from the StreamSets environment.
1. On the Manage tab of your project, click the StreamSets tool.
2. From the environment Options icon , click Get run command.
3. Click the Copy to Clipboard icon.
Paste the copied command into the command prompt, making the following changes to the command as needed:
- If you use Podman instead of Docker, change docker to podman.
- If the echo $(hostname) command did not return a valid hostname, change $(hostname) to the specific hostname of the workstation. If you run the Data Collector engine on the same workstation as your web browser, you can usually use localhost as the hostname. For example, you might change the --hostname argument as follows:
  --hostname "localhost" \
  
  Note: To change the hostname, you can also edit the StreamSets environment to customize the engine command. When you customize the command, your change is retained each time that you retrieve the engine command.
Run the engine command.

When the engine successfully starts, the command prompt displays the engine container ID.
View the container logs:
```
<docker|podman> logs <container_id>
```
For example, use the following command for Docker: docker logs <container_id>

The command displays the engine URL.
Verify that your browser can reach the engine HTTPS address.
1. Copy the engine URL from the command prompt.
2. Paste the URL in a browser address bar, and then add /public-rest/is-running to the end of the URL, as follows:
```
https://<host_name>:18630/public-rest/is-running
```
3. If the browser displays a security warning for the address, accept the browser options to proceed to the address.
  
  For example, Google Chrome displays the message Your connection is not private and the error NET::ERR_CERT_AUTHORITY_INVALID. Click Advanced, and then click Proceed to <hostname> (unsafe).
  
  After the browser reaches the engine HTTPS address, the browser displays the following message: Engine is running.

Running multiple engines for a single environment

Run multiple engines for a single environment to increase processing capacity or to support job failover and high availability.

When you run multiple engines for an environment:

Processing capacity increases with each additional engine.
Jobs can start on any online engine within resource thresholds.
Jobs are assigned arbitrarily when more than one engine is available.
When an engine shuts down unexpectedly, jobs can fail over to another available engine, starting from the last-saved offset.

To run multiple engines, set up a separate workstation with the engine prerequisites and then run the engine command on the additional workstation.

Note: The number of engines that you run determines your StreamSets compute usage.

Job failover guidelines

When an environment uses multiple engines, jobs can fail over to another engine if the active engine becomes unavailable. The job restarts on an available engine that is online and within the defined resource thresholds. The job continues from the last-saved offset.

The new engine starts processing from the last saved offset recorded by the previous engine. However, if the job stopped while processing a batch of data, some data might be processed again and can be duplicated. For more information, see Delivery guarantee.

A job can fail over up to three times. After the job reaches the maximum number of failover retries, the job run fails.

Before you run multiple engines to support job failover and high availability, review the following guidelines:

Verify that source stages maintain offsets

Confirm that the source stages in your flows maintain offsets. Most source stages save the offset while processing data, so subsequent job runs continue from the last-saved offset. You can run these jobs on environments with multiple engines.

However, some source stages do not maintain offsets. When these jobs restart, they process from the initial offset, which causes duplicate data processing. For jobs with these source stages, use an environment with a single engine.

For a list of source stages that maintain offsets, see Sources that maintain offsets.

Verify that all engines can access source and target systems

Confirm that all engines in an environment can access the source and target systems that your flows use.

For example, when the source is an external system such as a relational database or Elasticsearch, any engine in the network can continue processing from the last-saved offset recorded by another engine.

However, when the source is tied to a particular engine workstation, other engines cannot continue processing from the last-saved offset.

For example, a job uses a Directory source that reads from a local directory on the engine workstation. If the engine shuts down unexpectedly, no other engine can access the local files. For jobs with these sources, use an environment with a single engine.

Configure source system resiliency

Job failover provides high availability for job processing but not for incoming data. Job failover might take several minutes. To prevent data loss during failover, configure the source system to support job failover.

For example, a flow uses a source stage that listens for client requests, such as the HTTP Server or WebSocket Server source. When the job fails over, the client might continue sending requests during the downtime, which can result in lost data. To avoid data loss, configure the source system in the following ways:

Configure clients to resend requests when errors occur while sending data.
Set up load balancing on the source system to redirect client requests to the remaining running engines during failover.

Customizing the engine command

About this task

You can customize the engine run command to add environment variables or Docker or Podman command options. You edit the StreamSets environment details to customize the command. The StreamSets environment includes the customization when you retrieve the command.

Important: Use caution when customizing the engine run command. If you add environment variables or command options with the incorrect syntax or configuration, the engine run command might fail.

You can add the following information to customize the command:

Environment variables

Add environment variables that you want the engine container to use.

You can add any environment variable, with the following restrictions:

An environment variable name cannot include the equal sign (=).
You cannot add or override the environment variables that are included in the default command, such as SSET_PROJECT_ID and SSET_BASE_URL.

Docker or Podman command options

Add Docker or Podman run command options.

For example, you might want to add the following options:

mount to define the path to the keystore file created for engine HTTPS communication or to define the path to the credential store properties file. For more information, see Enabling HTTPS host verification or Configuring credential stores.
hostname to define the engine workstation name when using the direct engine REST API communication method and the echo $(hostname) command does not return a valid hostname. For more information, see Running the engine command.

You can add any Docker or Podman run command options, with the following restrictions:

You cannot use the cpus option to override the number of VPCs allocated to the engine container. You define this value when you create the StreamSets environment.
Do not use the e or env option to add an environment variable. Instead, add an environment variable as a key-value pair in the Environment variables section.

Procedure

If the engine is running, stop the engine.
1. Determine the container ID for the engine:
```
<docker|podman> ps
```
  For example, use the following command for Docker: docker ps
2. Copy the ID of the container that you want to update.
3. Stop the engine:
```
<docker|podman> stop <container_id>
```
On the Manage tab of your project, click the StreamSets tool.
For the environment, click Options > Edit environment.
Expand the Advanced configuration section.
Add an environment variable.
1. In the Environment variables section, click Add value.
2. Enter the environment variable name under Key and the environment variable value under Value.
Add a Docker or Podman command option.
1. In the Docker command options section, click Add value.
2. Enter the command option.
  For example, to define the engine workstation hostname, add the following value:
```
--hostname "localhost"
```
  Important: If you add multiple command options, add each option as a separate value.
Save your changes.
For the environment, click Options > Get run command, and then copy the command.

Notice that the copied command includes your customization.
Run the customized engine command.