Streaming real-time data

You can build streaming data flows that act on time-sensitive data, rather than waiting to process data on an intermittent or scheduled basis.

Data engineers use StreamSets to build and run streaming data flows that access and connect data across various types of data sources. A streaming flow runs continuously to read, process, and write data when the data becomes available. Streaming data flows support light in-flight transformations.

With StreamSets, data engineers can:
  • Access data from multiple types of external systems, including cloud data lakes, cloud data warehouses, and storage systems installed on-premises such as relational databases.
  • Build streaming data flows that use an intuitive graphical design interface.
  • Detect and correct unexpected data drift.

Before you can build and run StreamSets flows in a project, you must create a StreamSets environment for the project. You then run a Data Collector engine in the location where the data resides, which can be on-premises or on a protected cloud computing platform. You build and manage your flows in IBM watsonx.data integration, then run the flows as jobs on the engine.

The engine uses the flow configuration to process the data. As the job runs, the engine sends status updates and metrics back to watsonx.data integration so that you can monitor the job progress in real time. Since jobs run in your corporate network, you maintain all ownership and control of your data.

Requirements

The following requirements exist for StreamSets:

Cloud platforms
IBM Cloud
Required service
IBM watsonx.data integration
Data formats
StreamSets supports the following data formats:
  • Tables in relational data sources
  • Avro
  • Binary
  • Datagram
  • Delimited
  • Excel
  • JSON
  • Log
  • Parquet
  • Protobuf
  • Text
  • Whole File
  • XML
For more information, see Data formats overview.
Data size
StreamSets works with data of any size.
Required permissions
Your role determines which tasks you can complete:
  • To administer a StreamSets Data Collector engine for a project, you must have the Editor or Admin role in the project.
  • To create a StreamSets flow and run a job for the flow, you must have the Editor or Admin role in the project.
  • To view the job run details created for a StreamSets flow, you can have the Viewer, Editor, or Admin role in the project.

Streaming real-time data tasks

Complete the following high-level tasks to stream real-time data:

  1. Administer StreamSets Data Collector engines.

    An administrator creates a StreamSets environment for your project to configure a Data Collector engine. The administrator then runs the engine in your corporate network, which can be on-premises or on a protected cloud computing platform.

  2. Create StreamSets flows.

    A data engineer creates a StreamSets flow to define how data flows from source to target systems and how the data is processed along the way.

  3. Run jobs.

    A data engineer runs a job for a finished flow. The job runs on the environment that is selected for the flow.

Learn more