Getting started with IBM StreamSets

Use IBM StreamSets to build, run, and monitor streaming data pipelines. A streaming data pipeline runs continuously to read, process, and write data as soon as the data becomes available. With streaming pipelines, you can act on time-sensitive data, rather than waiting to process data on an intermittent or scheduled basis.

With IBM StreamSets, data engineers can:
  • Access data from multiple types of external data sources that are located in the cloud or on premises.
  • Detect and correct unexpected data drift.
  • Collaboratively build pipelines as a team.
  • Design reusable fragments to add the same processing logic to multiple pipelines.

Checking whether the service is installed

An administrator must install IBM StreamSets.

To check whether the service is installed:

  1. From the navigation menu, select Services > Services catalog.
  2. Search for IBM StreamSets.

If the service is installed and ready to use, the tile in the catalog shows Ready to use.

If the service is installed but no service instances have been created, the tile in the catalog shows Ready to provision.

Important: Even if the service is Ready to use, you must be added to a service instance to use the service.

Accessing the service

Pop-out IBM StreamSets is a pop-out service. You can access the service from the Service > Service instances page.

Checking whether an engine is deployed

Data Collector is an engine that processes data. An IBM StreamSets organization administrator must deploy a Data Collector engine and grant you access to the engine before you can begin building a streaming pipeline.

To check whether you have access to a deployed Data Collector engine:

  1. Open IBM StreamSets from the Services > Instances page.
  2. From the IBM StreamSets navigation menu, select Set Up > Engines.

    If you have access to a deployed Data Collector engine, the engine URL is listed.

Building and running a streaming pipeline

To build a pipeline, you add origins, processors, and destinations to the graphical pipeline canvas.

Stages added to the pipeline canvas

  1. Create a pipeline.
    1. Open IBM StreamSets, and then click Go to Pipeline Canvas > Data Collector.
    2. Select the Data Collector engine that your organization administrator has deployed, and then click Next.
  2. Choose the source.
    1. In the pipeline canvas, click Add Origin and select the external system that you want to read from.
    2. Configure the origin properties.

      For more information, see Origins in the IBM StreamSets documentation.

  3. Specify how to transform the data.
    1. Click Add Stage > Processors and select a processor. For example, you might want to mask sensitive data, remove unnecessary fields, or perform calculations on data.
    2. Configure the processor properties.

      For more information, see Processors in the IBM StreamSets documentation.

    3. Add more processors to transform the data in other ways.
  4. Choose the target.
    1. Click Add Stage > Destinations and select the external system that you want to write to.
    2. Configure the destination properties.

      For more information, see Destinations in the IBM StreamSets documentation.

  5. Run the pipeline.

    Click Draft Run > Start Pipeline. As the pipeline runs, you can view statistics and error information about the data as it flows from origin to destination systems.

Learn more

To learn more about IBM StreamSets, see the following topics in the IBM StreamSets documentation: