Creating a StreamSets environment

Create a StreamSets environment to configure Data Collector engines and the compute resources for the flows and jobs in your project.

About this task

When you create an environment, you specify the following information:
  • Data Collector engine version
  • Stage libraries to install on the engine
  • Number of VPCs allocated to the engine, which determines your StreamSets compute resources
  • Engine advanced configuration

After you save the environment, you cannot change the engine version or the number of VPCs allocated to the engine.

Procedure

  1. On the Manage tab of your project, click the StreamSets tool.
  2. Click New environment.
  3. Configure the following properties:
    Property Description
    Name Name of the environment.
    Description Optional description that informs your team of the environment use case.
    Data Collector engine version Data Collector engine version to run.
  4. In the Configure details section, configure the following properties:
    Property Description
    Stage libraries Stage libraries to install on the engine. The installed stage libraries determine the stages, such as sources and targets, that you can use in flows.

    You can use the default stage libraries to get started. To install more, click Select stage libraries.

    External resources Archive file that contains the external resources that are used by the engine. The archive file must be in TGZ or compressed file format, use the required directory structure, and be imported as a data asset in your project.
    VPCs allocated to engine Number of VPCs allocated to the engine container.

    Default is 4, which meets the processing needs for most streaming use cases.

    The number of VPCs determines your StreamSets compute resources.

  5. Optionally, expand the Resource thresholds section and modify the following thresholds. You can use the defaults to get started.
    Threshold Description
    Max CPU load

    Maximum percentage of CPU in the container that an engine can use. When an engine equals or exceeds this threshold, new jobs do not start on the engine.

    This threshold is monitored with engine versions 6.4 and later.

    Default is 80.

    Max memory used

    Maximum percentage of the configured Java heap size that an engine can use. When an engine equals or exceeds this threshold, new jobs do not start on the engine.

    This threshold is monitored with engine versions 6.4 and later.

    Default is 100.

    Max jobs running

    Maximum number of jobs that can run on an engine at the same time. When an engine equals this threshold, new jobs do not start on the engine.

    This threshold is monitored with engine versions 6.4 and later.

    Default is 10.

  6. Optionally, expand the Advanced configuration section and define the following configurations:
    Advanced configuration Description
    Data Collector engine properties Data Collector configuration properties for advanced use cases, including using a custom keystore file for more secure HTTPS communication, enabling the use of credential stores, configuring the engine to send emails from an Email executor included in a flow, or developing custom stages.
    Log4j2 properties Engine log configuration properties for modifying the log level or customizations for advanced use cases.
    JVM options Java virtual machine (JVM) options for the engine.
    Environment variables Environment variables added to the engine run command.
    Docker command options Docker or Podman command options added to the engine run command.
    Custom CA certificate Custom CA certificate for connecting to systems that use self-signed certificates.
  7. Click Save.
  8. Complete the listed prerequisites. Then, copy and run the engine command to run an engine for this environment.

    For detailed steps, see Running a Data Collector engine.