Tutorial: Data Collectors, Pipelines, and Jobs
This tutorial covers the basic tasks required to get up and running with StreamSets Control Hub. In this tutorial, we register a Data Collector with Control Hub, design and publish a pipeline, and create and start a job for the pipeline.
Although our tutorial provides a simple use case, keep in mind that Control Hub is a powerful tool that enables you to orchestrate and monitor large numbers of pipelines running on groups of Data Collectors.
To get started with Control Hub, we'll complete the following tasks:
Before You Begin
This tutorial assumes that you have a running StreamSets Data Collector and a user account to log in to Control Hub.
Data Collectors work directly with Control Hub - they execute standalone and cluster pipelines run from Control Hub jobs.
Before you begin this tutorial, you'll need a few things:
Register a Data Collector
You register Data Collectors from Control Hub. StreamSets recommends registering the latest version of Data Collector to ensure that you can use the newest features.
For a Data Collector installation with Cloudera Manager, you must use Cloudera Manager to register the Data Collector.
Assign Labels to the Data Collector
Use labels to group Data Collectors registered with Control Hub. You assign labels to each Data Collector, using the same label for Data Collectors that you want to function as a group.
When you create a job, you assign labels to the job so that Control Hub knows on which group of Data Collectors the job should start.
For example, your organization uses development and test environments to design and test pipelines before replicating the final pipelines in the production environment. You assign a Test label to execution Data Collectors used to run test pipelines and a Production label to execution Data Collectors used to run production pipelines. When you create jobs, you select the appropriate label to ensure that the jobs are started in the correct environment.
You can assign multiple labels to Data Collectors to group Data Collectors by a combination of projects, geographic regions, environments, departments, or any other classification you choose.
- Control Hub configuration file
- Define labels for the
dpm.remote.control.job.labels
property in the Control Hub configuration file,$SDC_CONF/dpm.properties
, located in the Data Collector installation. - Control Hub UI
- View the details of a registered Data Collector in the Control Hub UI, and then add the label.
Let's assume that our newly registered Data Collector is located on the West Coast and will be used to run pipelines for departments located in the west. So we'll use the Control Hub UI to assign a new label to designate the western region.
Design a Pipeline
You design pipelines in Control Hub using the Control Hub Pipeline Designer. A pipeline describes the flow of data from an origin system to destination systems and defines how to transform the data along the way.
StreamSets Control Hub provides support for multiple origin and destination systems, such as relational databases, log files, and cloud storage platforms. For a complete list of supported systems, see Origins and Destinations.
For now, we'll design and configure a single test pipeline that uses development stages. If you already have a test pipeline in the Control Hub pipeline repository that you'd like to use, feel free to use it and activate at least one metric rule for it. Otherwise, create a simple pipeline using the development stages and metric rules as described below.
Publish the Pipeline
Next, we'll publish the pipeline to indicate that our design is complete and the pipeline is ready to be added to a job and run. When you publish a pipeline, you enter a commit message. Control Hub maintains the commit history of each pipeline.
Now that we've registered Data Collector, designed a pipeline, and published two versions of the test pipeline to Control Hub, let's add the pipeline to a job so that we can run it.
Add a Job for the Pipeline
A job defines the pipeline to run and the Data Collectors that run the pipeline.
When you add a job, you specify the published pipeline to run and you select Data Collector labels for the job. The labels indicate which group of Data Collectors should run the pipeline.
Start and Monitor the Job
Let's start our job and monitor its progress in Control Hub.
That's the end of our Control Hub tutorial on Data Collectors, pipelines, and jobs. Remember that our tutorial was simple by design to introduce the concepts of Control Hub. However, the true power of Control Hub is its ability to orchestrate and monitor many pipelines running across groups of Data Collectors.