Running a Data Collector engine

After you create a StreamSets environment, run one or more Data Collector engines for the environment.

Each engine runs on a separate workstation in your corporate network. Set up each workstation to meet the engine prerequisites. Then use the command line to run an engine as a container on a container management system such as Docker or Podman.

Run multiple engines for an environment to increase processing capacity or to support high availability. For more information, see Running multiple engines for a single environment.

Complete the following tasks to run a Data Collector engine:
  1. Complete the prerequisites
  2. Run the engine command

Prerequisites

Before you run an engine, complete the required prerequisites.

Complete account prerequisites once. You can use the same API keys for all environments and engines in your account.

Complete engine workstation prerequisites on each workstation where you plan to run an engine.

Account prerequisites

Before you run an engine, complete the following prerequisites for your IBM watsonx account. You complete these steps only once and can reuse the API keys for all environments and engines.

Create a user API key (task credentials)

Running StreamSets jobs requires task credentials for secure authorization. Task credentials is a user-generated API key, securely stored in Vault, that facilitates authentication for long-running data integration tasks. For more information, see Creating task credentials for jobs.

Create a cloud account API key

Running a Data Collector engine requires a cloud account API key for secure authorization. You provide this key when you run the engine command.

IBM Cloud To create an API key for watsonx.data integration on IBM Cloud:

  1. In your watsonx account on IBM Cloud, from the navigation menu, select Administration > Access (IAM).
  2. In the IBM Cloud console, select API keys.
  3. Click Create.
  4. Save or download the API key value.

AWS To create an API key for watsonx.data integration on AWS:

  1. In your AWS account, from the navigation menu, select Administration > Access (IAM) > Personal API keys.
  2. Click Create.
  3. Save or download the API key value.

Engine workstation prerequisites

Complete the following prerequisites on every workstation where you plan to run an engine.

Verify minimum system requirements

Verify that the engine workstation meets the following minimum requirements:

Component Minimum requirement
Operating system Any Linux distribution
Cores 2
RAM 4 GB
Disk space 6 GB
Note: Do not use NFS or NAS to store Data Collector files.

Install a container management system

A Data Collector engine runs as a container on a container management system, such as Docker or Podman.

Install Docker or Podman on the engine workstation.

Configure firewall access

If the engine workstation is behind a firewall, configure outbound access to required systems.

For more information, see Firewall access for StreamSets.

Running the engine command

Use the command line to run a Data Collector engine as a container.

About this task

The procedure that you use to run the engine depends on your engine version and communication method. For more information about the communication methods, see Engine communication.

Engine version Communication method and procedure
7.1.0-0115 and later
7.1.0 and earlier Direct - Running the engine with direct communication

Running the engine with tunneling communication (default)

Use the following procedure to run an engine that uses tunneling communication.

About this task

7.1.0-0115 and later

An engine uses tunneling communication when the Data Collector engine version is 7.1.0-0115 and later and the default tunneling communication method is configured for your account.

Important: Complete the required prerequisites before you run the engine command.

Procedure

  1. In a UNIX shell such as Bash, export the API key that you created for the engine as a prerequisite:
    export SSET_API_KEY=<api_key>
  2. Retrieve the engine command from the StreamSets environment.
    1. On the Manage tab of your project, click the StreamSets tool.
    2. From the environment Options icon Options icon, click Get run command.
    3. Click the Copy to Clipboard icon.
  3. Paste the copied command into the command prompt.
  4. If you use Podman instead of Docker, change docker to podman in the command.
  5. Run the engine command.

    When the engine successfully starts, the command prompt displays the engine container ID.

Running the engine with direct communication

Use the following procedure to run an engine that uses direct communication.

About this task

An engine uses direct communication in the following situations:
  • The Data Collector engine version is 7.1.0 and earlier.
  • The Data Collector engine version is 7.1.0-0115 and later and the direct engine communication method is configured for your account.
Important: Complete the required prerequisites before you run the engine command.

Procedure

  1. In a UNIX shell such as Bash, export the API key that you created for the engine as a prerequisite:
    export SSET_API_KEY=<api_key>
  2. Verify that the following command returns a valid hostname for the engine workstation:
    echo $(hostname)

    If the command does not return a valid result, then you must customize the copied engine command to specify the hostname.

  3. Retrieve the engine command from the StreamSets environment.
    1. On the Manage tab of your project, click the StreamSets tool.
    2. From the environment Options icon Options icon, click Get run command.
    3. Click the Copy to Clipboard icon.
  4. Paste the copied command into the command prompt, making the following changes to the command as needed:
    • If you use Podman instead of Docker, change docker to podman.
    • If the echo $(hostname) command did not return a valid hostname, change $(hostname) to the specific hostname of the workstation. If you run the Data Collector engine on the same workstation as your web browser, you can usually use localhost as the hostname. For example, you might change the --hostname argument as follows:

      --hostname "localhost" \

      Note: To change the hostname, you can also edit the StreamSets environment to customize the engine command. When you customize the command, your change is retained each time that you retrieve the engine command.
  5. Run the engine command.

    When the engine successfully starts, the command prompt displays the engine container ID.

  6. View the container logs:
    <docker|podman> logs <container_id>

    For example, use the following command for Docker: docker logs <container_id>

    The command displays the engine URL.

  7. Verify that your browser can reach the engine HTTPS address.
    1. Copy the engine URL from the command prompt.
    2. Paste the URL in a browser address bar, and then add /public-rest/is-running to the end of the URL, as follows:
      https://<host_name>:18630/public-rest/is-running
    3. If the browser displays a security warning for the address, accept the browser options to proceed to the address.

      For example, Google Chrome displays the message Your connection is not private and the error NET::ERR_CERT_AUTHORITY_INVALID. Click Advanced, and then click Proceed to <hostname> (unsafe).

      After the browser reaches the engine HTTPS address, the browser displays the following message: Engine is running.

Running multiple engines for a single environment

Run multiple engines for a single environment to increase processing capacity or to support job failover and high availability.

When you run multiple engines for an environment:
  • Processing capacity increases with each additional engine.
  • Jobs can start on any online engine within resource thresholds.
  • Jobs are assigned arbitrarily when more than one engine is available.
  • When an engine shuts down unexpectedly, jobs can fail over to another available engine, starting from the last-saved offset.

To run multiple engines, set up a separate workstation with the engine prerequisites and then run the engine command on the additional workstation.

Note: The number of engines that you run determines your StreamSets compute usage. For more information, see Monitoring account resource usage.

Job failover guidelines

6.4 and later

When an environment uses multiple engines, jobs can fail over to another engine if the active engine becomes unavailable. The job restarts on an available engine that is online and within the defined resource thresholds. The job continues from the last-saved offset.

The new engine starts processing from the last saved offset recorded by the previous engine. However, if the job stopped while processing a batch of data, some data might be processed again and can be duplicated. For more information, see Delivery guarantee.

Jobs can fail over when the environment configures Data Collector engine versions 6.4 and later.

A job can failover up to three times. After the job reaches the maximum number of failover retries, the job run fails.

Before you run multiple engines to support job failover and high availability, review the following guidelines:

Verify that source stages maintain offsets
Confirm that the source stages in your flows maintain offsets. Most source stages save the offset while processing data, so subsequent job runs continue from the last-saved offset. You can run these jobs on environments with multiple engines.
However, some source stages do not maintain offsets. When these jobs restart, they process from the initial offset, which causes duplicate data processing. For jobs with these source stages, use an environment with a single engine.
For a list of source stages that maintain offsets, see Sources that maintain offsets.
Verify that all engines can access source and target systems
Confirm that all engines in an environment can access the source and target systems that your flows use.
For example, when the source is an external system such as a relational database or Elasticsearch, any engine in the network can continue processing from the last-saved offset recorded by another engine.
However, when the source is tied to a particular engine workstation, other engines cannot continue processing from the last-saved offset.
For example, a job uses a Directory source that reads from a local directory on the engine workstation. If the engine shuts down unexpectedly, no other engine can access the local files. For jobs with these sources, use an environment with a single engine.
Configure source system resiliency
Job failover provides high availability for job processing but not for incoming data. Job failover might take several minutes. To prevent data loss during failover, configure the source system to support job failover.
For example, a flow uses a source stage that listens for client requests, such as the HTTP Server or WebSocket Server source. When the job fails over, the client might continue sending requests during the downtime, which can result in lost data. To avoid data loss, configure the source system in the following ways:
  • Configure clients to resend requests when errors occur while sending data.
  • Set up load balancing on the source system to redirect client requests to the remaining running engines during failover.

Customizing the engine command

About this task

You can customize the engine run command to add environment variables or Docker or Podman command options. You edit the StreamSets environment details to customize the command. The StreamSets environment includes the customization when you retrieve the command.

Important: Use caution when customizing the engine run command. If you add environment variables or command options with the incorrect syntax or configuration, the engine run command might fail.

You can add the following information to customize the command:

Environment variables
Add environment variables that you want the engine container to use.
You can add any environment variable, with the following restrictions:
  • An environment variable name cannot include the equal sign (=).
  • You cannot add or override the environment variables that are included in the default command, such as SSET_PROJECT_ID and SSET_BASE_URL.
Docker or Podman command options
Add Docker or Podman run command options.
For example, you might want to add the following options:
  • mount to define the path to the keystore file created for engine HTTPS communication or to define the path to the credential store properties file. For more information, see Enabling HTTPS host verification or Configuring credential stores.
  • hostname to define the engine workstation name when using the direct engine REST API communication method and the echo $(hostname) command does not return a valid hostname. For more information, see Running the engine command.
You can add any Docker or Podman run command options, with the following restrictions:
  • You cannot use the cpus option to override the number of VPCs allocated to the engine container. You define this value when you create the StreamSets environment.
  • Do not use the e or env option to add an environment variable. Instead, add an environment variable as a key-value pair in the Environment variables section.

Procedure

  1. If the engine is running, stop the engine.
    1. Determine the container ID for the engine:
      <docker|podman> ps

      For example, use the following command for Docker: docker ps

    2. Copy the ID of the container that you want to update.
    3. Stop the engine:
      <docker|podman> stop <container_id>
  2. On the Manage tab of your project, click the StreamSets tool.
  3. For the environment, click Options > Edit environment.
  4. In the Advanced configurations section, click Click to configure.
  5. Add an environment variable.
    1. In the Environment variables section, click Add value.
    2. Enter the environment variable name under Key and the environment variable value under Value.
  6. Add a Docker or Podman command option.
    1. In the Docker command options section, click Add value.
    2. Enter the command option.
      For example, to define the engine workstation hostname, add the following value:
      --hostname "localhost"
      Important: If you add multiple command options, add each option as a separate value.
  7. Save your changes.
  8. For the environment, click Options > Get run command, and then copy the command.

    Notice that the copied command includes your customization.

  9. Run the customized engine command.