Running a Data Collector engine
You run a Data Collector engine in the location where data resides, which can be on-premises or on a protected cloud computing platform.
Prerequisites
Verify minimum system requirements
Verify that the engine workstation meets the following minimum requirements:
| Component | Minimum requirement |
|---|---|
| Operating system | Any Linux distribution |
| Cores | 2 |
| RAM | 4 GB |
| Disk space | 6 GB Note: Do not use NFS or NAS to store Data Collector files.
|
Install a container management system
You run a Data Collector engine as a container on a container management system, such as Docker or Podman.
Verify that Docker or Podman is installed on the engine workstation.
Configure firewall access
If the engine workstation is behind a firewall, configure the firewall to allow outbound connections to several systems.
For more information, see Firewall access for StreamSets.
Create a user API key
Running StreamSets jobs requires a user API key for secure authorization.
To verify whether your account has an active user API key, click your avatar and select Profile and settings to open your account profile. Select User API key to view the Active keys.
If you do not have an active API key, create a key by clicking Create a key.
Create an IBM Cloud API key
Running a Data Collector engine requires an IBM Cloud API key for secure authorization. You enter the key value when you run the engine command.
To verify whether your account has an active IBM Cloud API key, choose from the navigation menu. In the IBM Cloud console, choose API keys from the navigation menu. Your list of active keys display.
If you do not have an active IBM Cloud API key, click Create. Save or download the API key value.
For more information, see Managing user API keys in the IBM Cloud documentation.
Creating a StreamSets environment
Create a StreamSets environment for your project. An environment defines the Data Collector engine version, engine configuration, and the stage libraries to install on the engine. The installed stage libraries determine the stages, such as sources and targets, that you can use in flows.
About this task
After you save the environment, you cannot change the engine version.
Procedure
Running the engine command
Use the command line to run a Data Collector engine as a container.
About this task
Complete the required prerequisites before you run the engine command.
Procedure
Customizing the engine command
About this task
You can customize the engine run command to add environment variables or Docker or Podman command options. You edit the StreamSets environment details to customize the command. The StreamSets environment includes the customization when you retrieve the command.
You can add the following information to customize the command:
- Environment variables
- Add environment variables that you want the engine container to use.
- Docker or Podman command options
- Add Docker or Podman run command options.