Creating a StreamSets environment

Create a StreamSets environment to configure Data Collector engines and the compute resources for the flows and jobs in your project.

About this task

When you create an environment, you specify the following information:

Data Collector engine version
Stage libraries to install on the engine
Number of VPCs allocated to the engine, which determines your StreamSets compute resources
Engine advanced configuration

After you save the environment, you cannot change the engine version or the number of VPCs allocated to the engine.

Procedure

On the Manage tab of your project, click the StreamSets tool.
Click New environment.

Configure the following properties:

Property	Description
Name	Name of the environment.
Description	Optional description that informs your team of the environment use case.
Data Collector engine version	Data Collector engine version to run.

In the Configure details section, configure the following properties:

Property	Description
Stage libraries	Stage libraries to install on the engine. The installed stage libraries determine the stages, such as sources and targets, that you can use in flows. You can use the default stage libraries to get started. To install more, click Select stage libraries.
External resources	Archive file that contains the external resources that are used by the engine. The archive file must be in TGZ or compressed file format, use the required directory structure, and be imported as a data asset in your project.
VPCs allocated to engine	Number of VPCs allocated to the engine container. Default is 4, which meets the processing needs for most streaming use cases. The number of VPCs determines your StreamSets compute resources.

Property

Description

Stage libraries

Stage libraries to install on the engine. The installed stage libraries determine the stages, such as sources and targets, that you can use in flows.

You can use the default stage libraries to get started. To install more, click Select stage libraries.

External resources

Archive file that contains the external resources that are used by the engine. The archive file must be in TGZ or compressed file format, use the required directory structure, and be imported as a data asset in your project.

VPCs allocated to engine

Number of VPCs allocated to the engine container.

Default is 4, which meets the processing needs for most streaming use cases.

The number of VPCs determines your StreamSets compute resources.

Optionally, expand the Resource thresholds section and modify the following thresholds. You can use the defaults to get started.

Threshold	Description
Max CPU load	Maximum percentage of CPU in the container that an engine can use. When an engine equals or exceeds this threshold, new jobs do not start on the engine. This threshold is monitored with engine versions 6.4 and later. Default is 80.
Max memory used	Maximum percentage of the configured Java heap size that an engine can use. When an engine equals or exceeds this threshold, new jobs do not start on the engine. This threshold is monitored with engine versions 6.4 and later. Default is 100.
Max jobs running	Maximum number of jobs that can run on an engine at the same time. When an engine equals this threshold, new jobs do not start on the engine. This threshold is monitored with engine versions 6.4 and later. Default is 10.

Threshold

Description

Max CPU load

Maximum percentage of CPU in the container that an engine can use. When an engine equals or exceeds this threshold, new jobs do not start on the engine.

This threshold is monitored with engine versions 6.4 and later.

Default is 80.

Max memory used

Maximum percentage of the configured Java heap size that an engine can use. When an engine equals or exceeds this threshold, new jobs do not start on the engine.

This threshold is monitored with engine versions 6.4 and later.

Default is 100.

Max jobs running

Maximum number of jobs that can run on an engine at the same time. When an engine equals this threshold, new jobs do not start on the engine.

This threshold is monitored with engine versions 6.4 and later.

Default is 10.

Optionally, expand the Advanced configuration section and define the following configurations:

Advanced configuration	Description
Data Collector engine properties	Data Collector configuration properties for advanced use cases, including using a custom keystore file for more secure HTTPS communication, enabling the use of credential stores, configuring the engine to send emails from an Email executor included in a flow, or developing custom stages.
Log4j2 properties	Engine log configuration properties for modifying the log level or customizations for advanced use cases.
JVM options	Java virtual machine (JVM) options for the engine.
Environment variables	Environment variables added to the engine run command.
Docker command options	Docker or Podman command options added to the engine run command.
Custom CA certificate	Custom CA certificate for connecting to systems that use self-signed certificates.

Click Save.
Complete the listed prerequisites. Then, copy and run the engine command to run an engine for this environment.

For detailed steps, see Running a Data Collector engine.