Data Collector Communication
StreamSets Control Hub works with Data Collector to design pipelines and to execute standalone and cluster pipelines.
After you install StreamSets Control Hub, you install Data Collectors on-premises or on a protected cloud computing platform, and then register them to work with Control Hub.
- Authoring Data Collector
- Use an authoring Data Collector to design pipelines and to create connections. You can design pipelines in the Control Hub after selecting an available authoring Data Collector. The selected authoring Data Collector determines the stages, stage libraries, and functionality that display in Pipeline Designer.
- Execution Data Collector
- Use an execution Data Collector to execute standalone and cluster pipelines run from Control Hub jobs.
A single Data Collector can serve both purposes. However, StreamSets recommends dedicating each Data Collector as either an authoring or execution Data Collector.
Registered Data Collectors use encrypted REST APIs to communicate with Control Hub. Data Collectors initiate outbound connections to Control Hub on the port number configured in the Control Hub system.
The web browser that accesses Control Hub Pipeline Designer uses encrypted REST APIs to communicate with Control Hub. The web browser initiates outbound connections to Control Hub on the port number configured in the Control Hub system.
The authoring Data Collector selected for Pipeline Designer or for connection creation accepts inbound connections from the web browser on the port number configured for the Data Collector. Similarly, the execution Data Collector accepts inbound connections from the web browser when you monitor real time-summary statistics, error information, and snapshots for active jobs.
Each outbound and inbound connection must use the same protocol, HTTP or HTTPS, as the Control Hub system.
The following image shows how authoring and execution Data Collectors communicate with Control Hub:
Data Collector Requests
Registered Data Collectors send requests and information to several of the Control Hub applications.
Control Hub applications do not directly send requests to Data Collectors. Instead, Control Hub applications send requests using encrypted REST APIs to a messaging queue managed by the Messaging application. Data Collectors periodically check with the queue to retrieve application requests.
- Pipeline Store
- When you use an authoring Data Collector to publish a pipeline to Control Hub or to download a published pipeline from Control Hub, the Data Collector sends the request to the Pipeline Store application.
- Connection
- When you start a job for a pipeline that uses a connection, the execution Data Collector requests the connection properties from the Connection application.
- Job Runner
- Every minute, Data Collectors send a heartbeat, the last-saved offsets, and the status of all remotely
running pipelines to the Job Runner application so that Control Hub can manage job execution.Note: Data Collector version 3.13.0 and earlier sends this information to the Messaging application.
- Security
- When you enable Control Hub within a Data Collector or when a user logs into a registered Data Collector, the Data Collector makes an authentication request to the Security application.
- Time Series
- Every minute, an execution Data Collector sends aggregated metrics for remotely running pipelines to the Time Series application.
- Messaging
- Data Collectors send the following information to the Messaging application:
- At startup, a Data Collector sends the following information: Data Collector version, HTTP URL of the Data Collector, and labels configured in the Control Hub configuration file, $SDC_CONF/dpm.properties.
- When you update permissions on local pipelines, the Data Collector sends the updated pipeline permissions.
Cluster Pipeline Communication
When you run a cluster pipeline from a Control Hub job, the Data Collector installed on the gateway node communicates with Control Hub.
When a pipeline runs in cluster batch or cluster streaming mode, the Data Collector installed on the gateway node submits jobs to YARN. The jobs run on worker nodes in the cluster as either Spark streaming or MapReduce jobs. Data Collectors do not need to be installed on worker nodes in the cluster - the pipeline and all of its dependencies are included in the launched job.
The Data Collector worker processes running under YARN do not communicate with Control Hub. Instead, each Data Collector worker sends metrics for the running pipeline to the system configured to aggregate statistics for the pipeline - Kafka, Amazon Kinesis Streams, or MapR Streams. The gateway Data Collector reads and aggregates all pipeline statistics from the configured system and then sends the aggregated statistics to the Control Hub Time Series application.
Similarly, the Data Collector worker processes send a heartbeat back to the gateway Data Collector. The gateway Data Collector sends the URL of each Data Collector worker back to the Control Hub messaging queue.