Installation Requirements
Install StreamSets Control Hub on a machine that meets the following minimum requirements:
Component | Minimum Requirement |
---|---|
Operating system | Use one of the following operating systems and versions:
|
Java | Use one of the following Java versions:
Note: Java 8u161 or earlier also requires that you download
Java Cryptography
Extension (JCE) Unlimited Strength Jurisdiction Policy
Files 8.
|
The remaining requirements depend on whether you are installing a single Control Hub instance for a development environment or multiple Control Hub instances for a highly available production environment:
Component | Single Installation | Multiple Installations for High Availability |
---|---|---|
CPU | 8 | 4 |
RAM | 15 GB | 7.5 GB |
Disk space | 50 GB | 30 GB |
- Single installation - c4.2xlarge
- Multiple installations for high availability - c4.xlarge
General Access Requirements
After installation, Control Hub requires access to the following components. These components can be local or remote to the Control Hub installations:
Component | Minimum Requirement |
---|---|
SMTP account | SMTP account to send emails. |
Load balancer | Load balancer to set up a highly available Control Hub system. We recommend using a Layer 7 load balancer such as HAProxy,
NGINX, or F5. Required for a production environment, optional for a development environment. |
Browser | Use the latest version of one of the following browsers:
Ensure that the browser can access registered Data Collectors and Transformers. |
StreamSets Data Collector | StreamSets recommends using the latest version of Data Collector. The minimum supported Data Collector
version depends on how you use Data Collector:
If needed, you can customize the supported Data Collector version range. |
StreamSets Transformer | StreamSets
recommends using the latest version of Transformer to design and execute
Transformer pipelines from Control Hub. Version 3.16.0 or later is required to use connections. |
Statistics aggregator | Use one of the following systems to aggregate pipeline statistics
when jobs run on multiple Data Collectors:
Note: In a development environment, you can also use SDC RPC to
aggregate pipeline statistics. Using SDC RPC to aggregate statistics
is not highly available and might cause the loss of some data. It
should be used for development purposes only.
|
Relational Database Requirements
Control Hub supports MariaDB, MySQL, or PostgreSQL for the relational database instance.
MariaDB Requirements
The relational database for a single Control Hub instance supports MariaDB 10.x. Control Hub is fully tested with MariaDB 10.11.
The relational database for a highly available Control Hub system supports MariaDB Galera Cluster 10.x.
MariaDB installations must meet the following minimum requirements:
Component | Single Installation | Multiple Installations for High Availability |
---|---|---|
CPU | 4 | 4 |
RAM | 30.5 GB | 30.5 GB |
Disk space | 50 GB | 100 GB |
- Single installation - db.r3.xlarge
- Multiple installations for high availability - db.r3.xlarge
MySQL Requirements
The relational database for a single Control Hub instance supports MySQL 5.6, 5.7, or 8.x. Control Hub is fully tested with MySQL 8.0.28.
The relational database for a highly available Control Hub system supports MySQL Enterprise High Availability 5.6, 5.7, or 8.x.
MySQL installations must meet the following minimum requirements:
Component | Single Installation | Multiple Installations for High Availability |
---|---|---|
CPU | 4 | 4 |
RAM | 30.5 GB | 30.5 GB |
Disk space | 50 GB | 100 GB |
- Single installation - db.r3.xlarge
- Multiple installations for high availability - db.r3.xlarge
PostgreSQL Requirements
The relational database for a single Control Hub instance supports PostgreSQL 9.4, 9.6, 11.x, or 14.x. Control Hub is fully tested with PostgreSQL 11.10 and 14.6.
The relational database for a highly available Control Hub system supports PostgreSQL 9.4, 9.6, 11.x, or 14.x with high availability enabled.
PostgreSQL installations must meet the following minimum requirements:
Component | Single Installation | Multiple Installations for High Availability |
---|---|---|
CPU | 4 | 4 |
RAM | 30.5 GB | 30.5 GB |
Disk space | 50 GB | 100 GB |
- Single installation - db.r3.xlarge
- Multiple installations for high availability - db.r3.xlarge
Time Series Database Requirements
The time series database for a single Control Hub instance supports InfluxDB 1.3.x, 1.7.x, or 1.9.x.
The time series database for a highly available Control Hub system supports InfluxDB Enterprise 1.3.x, 1.7.x, or 1.9.x with a minimum of 2 data nodes and 3 meta nodes in the cluster. A single data node and a single meta node can be deployed to the same server.
Influx installations must meet the following minimum requirements:
Component | Single Installation | Multiple Installations for High Availability |
---|---|---|
CPU | 4 | 8 |
RAM | 30.5 GB | 61 GB |
Disk space | 250 GB | 500 GB |
- Single installation - r4.xlarge
- Multiple installations for high availability - r3.2xlarge
Default Ports
The following table lists the default ports exposed to Control Hub clients and how they are used. Note that the default port numbers can be changed during installation.
In a development environment, configure network routes and firewalls so that web UI clients and registered Data Collectors and Provisioning Agents can reach the Control Hub IP addresses.
In a highly available production environment, configure network routes and firewalls so that the Control Hub instances, web UI clients, and registered Data Collectors and Provisioning Agents can reach the load balancer.
System | Default Port | Protocol | Usage |
---|---|---|---|
Control Hub |
|
TCP | Access to the Control Hub web-based UI and API for a single Control Hub instance in a development environment. Used by developers and administrators to access the UI. Used by registered Data Collectors and Provisioning Agents to access the API. |
Control Hub Admin tool |
|
TCP | Access to the Control Hub Admin tool web-based UI for a single Control Hub instance in a development environment. Used by administrators to access the UI. |
Load balancer | Depends on the chosen load balancer | TCP | When using multiple Control Hub instances in a highly available production environment, both Control Hub and the Control Hub Admin tool are accessed through a load balancer. |
The following table lists the default ports of the external systems that Control Hub depends on and how they are used. The default port numbers can change - confirm the actual numbers with your systems administrator.
External System | Default Port | Protocol | Usage |
---|---|---|---|
MariaDB | 3306 | TCP | Relational database that stores Control Hub application data. |
MySQL | 3306 | TCP | Relational database that stores Control Hub application data. |
PostgreSQL | 5432 | TCP | Relational database that stores Control Hub application data. |
InfluxDB | 8086 | TCP | Time series database that stores metrics. |
LDAP or LDAPS | 389 636 |
TCP | Used when Control Hub is configured for LDAP or LDAPS authentication. |
SMTP | 465 | TCP | Used by Control Hub to send email notifications. |
Browser Access to Data Collector and Transformer
- Authoring engines
- Authoring Data Collectors and Transformers accept inbound connections from the web browser when you design pipelines using Pipeline Designer.
- Execution engines
- Execution Data Collectors and
Transformers accept inbound connections from the web browser when you
complete the following tasks:
- Capture and view snapshots in an active Data Collector job.
- Monitor real-time statistics on the Realtime Summary tab for an active Data Collector or Transformer job.
- Monitor error records encountered by a pipeline stage in an active Data Collector job.
- View the execution engine log when monitoring an active Data Collector or Transformer job.
- View configuration properties, active Java threads, metric charts, logs, and directories when monitoring a Data Collector or Transformer from the Execute view.
Configure network routes and firewalls so that the Control Hub web browser can reach the URLs of registered Data Collectors and Transformers.
If registered Data Collectors and Transformers are installed on a cloud computing platform such as Amazon Elastic Compute Cloud (EC2), configure them to use a publicly accessible URL as described in Publicly Accessible URL for Data Collector or Publicly Accessible URL for Transformer.