Create a StreamSets environment to
configure Data Collector engines
and the compute resources for the flows and jobs in your project.
About this task
When you create an environment, you specify the following information:
- Data Collector engine
version
- Stage libraries to install on the engine
- Number of VPCs allocated to the engine, which determines your StreamSets compute
resources
- Engine advanced configuration
After you save the environment, you cannot change the engine version or the number of VPCs
allocated to the engine.
Procedure
-
On the Manage tab of your project, click the
StreamSets tool.
- Click New environment.
- Configure the following properties:
| Property |
Description |
| Name |
Name of the environment. |
| Description |
Optional description that informs your team of the environment use case. |
| Data Collector engine
version |
Data Collector engine
version to run. |
- In the Configure details section, configure the following
properties:
| Property |
Description |
| Stage
libraries |
Stage libraries to install on the engine. The installed stage libraries determine the stages,
such as sources and targets, that you can use in flows. You can use the default stage libraries to
get started. To install more, click Select stage libraries.
|
| External resources |
Archive file that contains the external resources that are used by the engine. The archive
file must be in TGZ or compressed file format, use the required directory structure, and be imported
as a data asset in your project. |
| VPCs allocated to engine |
Number of VPCs allocated to the engine container. Default is 4, which meets the processing
needs for most streaming use cases.
The number of VPCs determines your StreamSets compute resources.
|
- Optionally, expand the Resource thresholds section and modify the
following thresholds. You can use the defaults to get started.
| Threshold |
Description |
| Max CPU load |
Maximum percentage of CPU in the container that an engine can use. When an engine equals or
exceeds this threshold, new jobs do not start on the engine.
This threshold is monitored with engine versions 6.4 and later.
Default is 80.
|
| Max memory used |
Maximum percentage of the configured Java heap size that an engine can use. When an engine equals
or exceeds this threshold, new jobs do not start on the engine.
This threshold is monitored with engine versions 6.4 and later.
Default is 100.
|
| Max jobs running |
Maximum number of jobs that can run on an engine at the same time. When an engine equals this
threshold, new jobs do not start on the engine.
This threshold is monitored with engine versions 6.4 and later.
Default is 10.
|
- Optionally, expand the Advanced configuration section and define
the following configurations:
| Advanced configuration |
Description |
| Data Collector engine
properties |
Data Collector
configuration properties for advanced use cases, including using a custom keystore file for more
secure HTTPS communication, enabling the use of credential stores, configuring the engine to send
emails from an Email executor included in a flow, or developing custom stages. |
| Log4j2 properties |
Engine log configuration properties for modifying the log level or customizations for
advanced use cases. |
| JVM options |
Java virtual machine (JVM) options for the engine. |
| Environment variables |
Environment variables added to the engine run command. |
| Docker command options |
Docker or Podman command options added to the engine run command. |
| Custom CA certificate |
Custom CA certificate for connecting to systems that use self-signed certificates. |
- Click Save.
- Complete the listed prerequisites. Then, copy and run the engine command to run an engine
for this environment.