Running a Data Collector engine

You run a Data Collector engine in the location where data resides, which can be on-premises or on a protected cloud computing platform.

Complete the following tasks to run a Data Collector engine:
  1. Complete the prerequisites
  2. Create a StreamSets environment
  3. Run the engine command

Prerequisites

Verify minimum system requirements

Verify that the engine workstation meets the following minimum requirements:

Component Minimum requirement
Operating system Any Linux distribution
Cores 2
RAM 4 GB
Disk space 6 GB
Note: Do not use NFS or NAS to store Data Collector files.

Install a container management system

You run a Data Collector engine as a container on a container management system, such as Docker or Podman.

Verify that Docker or Podman is installed on the engine workstation.

Configure firewall access

If the engine workstation is behind a firewall, configure the firewall to allow outbound connections to several systems.

For more information, see Firewall access for StreamSets.

Create a user API key

Running StreamSets jobs requires a user API key for secure authorization.

To verify whether your account has an active user API key, click your avatar and select Profile and settings to open your account profile. Select User API key to view the Active keys.

If you do not have an active API key, create a key by clicking Create a key.

Create an IBM Cloud API key

Running a Data Collector engine requires an IBM Cloud API key for secure authorization. You enter the key value when you run the engine command.

To verify whether your account has an active IBM Cloud API key, choose Administration > Access (IAM) from the navigation menu. In the IBM Cloud console, choose API keys from the navigation menu. Your list of active keys display.

If you do not have an active IBM Cloud API key, click Create. Save or download the API key value.

For more information, see Managing user API keys in the IBM Cloud documentation.

Creating a StreamSets environment

Create a StreamSets environment for your project. An environment defines the Data Collector engine version, engine configuration, and the stage libraries to install on the engine. The installed stage libraries determine the stages, such as sources and targets, that you can use in flows.

About this task

After you save the environment, you cannot change the engine version.

Procedure

  1. On the Manage tab of your project, click the StreamSets tool.
  2. Click New environment.
  3. Configure the following properties:
    Property Description
    Name Name of the environment.
    Description Optional description that informs your team of the environment use case.
    Data Collector engine version Data Collector engine version to run.
  4. In the Configure details section, configure the following properties. You can use the defaults to get started.
    Property Description
    Stage libraries Stage libraries to install on the engine. The installed stage libraries determine the stages, such as sources and targets, that you can use in flows.

    You can use the default stage libraries to get started. To install more, click Select stage libraries.

    External resources Archive file that contains the external resources that are used by the engine. The archive file must be in TGZ or compressed file format, use the required directory structure, and be imported as a data asset in your project.
    Max CPU load

    Maximum percentage of CPU in the container that an engine can use. When an engine equals or exceeds this threshold, new jobs do not start on the engine.

    This threshold is monitored with engine versions 6.4 and later.

    Default is 80.

    Max memory used

    Maximum percentage of the configured Java heap size that an engine can use. When an engine equals or exceeds this threshold, new jobs do not start on the engine.

    This threshold is monitored with engine versions 6.4 and later.

    Default is 100.

    Max jobs running

    Maximum number of jobs that can run on an engine at the same time. When an engine equals this threshold, new jobs do not start on the engine.

    This threshold is monitored with engine versions 6.4 and later.

    Default is 10.

    VPCs allocated to engine Number of VPCs allocated to the engine container.

    Default is 4.

    Advanced configuration Advanced configurations for the engine. Use to define the following properties:
    Note: By default, the engine uses HTTPS and a self-signed SSL/TLS certificate that you can use to quickly enable HTTPS and start building flows. To use more secure communication, create a keystore file. For more information, see Enabling HTTPS.
  5. Click Save.

Running the engine command

Use the command line to run a Data Collector engine as a container.

About this task

Complete the required prerequisites before you run the engine command.

Procedure

  1. In a UNIX shell such as Bash, export the IBM Cloud API key that you created for the engine as a prerequisite:
    export SSET_API_KEY=<api_key>
  2. Verify that the following command returns a valid hostname for the engine workstation:
    echo $(hostname)

    If the command does not return a valid result, then you must customize the copied engine command to specify the hostname.

  3. Retrieve the engine command from the StreamSets environment.
    1. On the Manage tab of your project, click the StreamSets tool.
    2. From the environment Options icon Options icon, click Get run command.
    3. Click the Copy to Clipboard icon.
  4. Paste the copied command into the command prompt, making the following changes as needed:
    • If you use Podman, change docker to podman.
    • If the echo $(hostname) command did not return a valid hostname, change $(hostname) to the specific hostname of the workstation. If you run the Data Collector engine on the same workstation as your web browser, you can usually use localhost as the hostname. For example, you might change the --hostname argument as follows:

      --hostname "localhost" \

      Note: To change the hostname, you can also edit the StreamSets environment to customize the engine command. When you customize the command, your change is retained each time that you retrieve the engine command.
  5. Run the engine command.

    When the engine successfully starts, the command prompt displays the engine container ID.

  6. View the container logs:
    <docker|podman> logs <container_id>

    For example, use the following command for Docker: docker logs <container_id>

    The command displays the engine URL.

  7. If you used the default self-signed SSL/TLS certificate for the engine, verify that your browser can reach the engine HTTPS address.
    1. Copy the engine URL from the command prompt.
    2. Paste the URL in a browser address bar, and then add /public-rest/is-running to the end of the URL, as follows:
      https://<host_name>:18630/public-rest/is-running
    3. If the browser displays a security warning for the address, accept the browser options to proceed to the address.

      For example, Google Chrome displays the message Your connection is not private and the error NET::ERR_CERT_AUTHORITY_INVALID. Click Advanced, and then click Proceed to <hostname> (unsafe).

      After the browser reaches the engine HTTPS address, the browser displays the following message: Engine is running.

Customizing the engine command

About this task

You can customize the engine run command to add environment variables or Docker or Podman command options. You edit the StreamSets environment details to customize the command. The StreamSets environment includes the customization when you retrieve the command.

Important: Use caution when customizing the engine run command. If you add environment variables or command options with the incorrect syntax or configuration, the engine run command might fail.

You can add the following information to customize the command:

Environment variables
Add environment variables that you want the engine container to use.
You can add any environment variable, with the following restrictions:
  • An environment variable name cannot include the equal sign (=).
  • You cannot add or override the environment variables that are included in the default command, such as SSET_PROJECT_ID and SSET_BASE_URL.
Docker or Podman command options
Add Docker or Podman run command options.
For example, you might want to add the following options:
  • --mount to define the path to the keystore file created for engine HTTPS communication or to define the path to the credential store properties file. For more information, see Enabling HTTPS or Configuring credential stores.
  • --hostname to define the engine workstation name when the echo $(hostname) command does not return a valid hostname. For more information, see Running the engine command.
You can add any Docker or Podman run command options, with the following restrictions:
  • You cannot use the --cpus option to override the number of VPCs allocated to the engine container. You define this value when you create the StreamSets environment.
  • Do not use the -e or --env option to add an environment variable. Instead, add an environment variable as a key-value pair in the Environment variables section.

Procedure

  1. If the engine is running, stop the engine.
    1. Determine the container ID for the engine:
      <docker|podman> ps

      For example, use the following command for Docker: docker ps

    2. Copy the ID of the container that you want to update.
    3. Stop the engine:
      <docker|podman> stop <container_id>
  2. On the Manage tab of your project, click the StreamSets tool.
  3. For the environment, click Options > Edit environment.
  4. In the Advanced configurations section, click Click to configure.
  5. Add an environment variable.
    1. In the Environment variables section, click Add value.
    2. Enter the environment variable name under Key and the environment variable value under Value.
  6. Add a Docker or Podman command option.
    1. In the Docker command options section, click Add value.
    2. Enter the command option.
      For example, to define the engine workstation hostname, add the following value:
      --hostname "localhost"
      Important: If you add multiple command options, add each option as a separate value.
  7. Save your changes.
  8. For the environment, click Options > Get run command, and then copy the command.

    Notice that the copied command includes your customization.

  9. Run the customized engine command.