Streaming real-time data
Build streaming data flows to act on time-sensitive data, rather than waiting to process data on an intermittent or scheduled basis.
Data engineers use the StreamSets tool to build and run streaming data flows that access and connect data across various types of data sources. A streaming flow runs continuously to read, process, and write data when the data becomes available. Streaming data flows support light in-flight transformations.
- Access data from multiple types of external systems, including cloud data lakes, cloud data warehouses, and storage systems installed on-premises such as relational databases.
- Build streaming data flows that use an intuitive graphical design interface.
- Detect and correct unexpected data drift.
Before you can build and run StreamSets flows, you must create a StreamSets environment to configure Data Collector engines for your project. You then run the engines in the location where the data resides, which can be on-premises or on a protected cloud computing platform. You build and manage your flows in IBM watsonx.data integration, then run the flows as jobs on the engines.
An engine uses the flow configuration to process the data. As the job runs, the engine sends status updates and metrics back to watsonx.data integration so that you can monitor the job progress in real time. Since jobs run in your corporate network, you maintain all ownership and control of your data.
Requirements
The following requirements exist for StreamSets:
- Cloud platforms
- IBM Cloud AWS
- Required service
- IBM watsonx.data integration
- Data formats
- StreamSets supports
the following data formats:
- Tables in relational data sources
- Avro
- Binary
- Datagram
- Delimited
- Excel
- JSON
- Log
- Parquet
- Protobuf
- Text
- Whole File
- XML
- Data size
- StreamSets works with data of any size.
- Required permissions
- Your role determines which tasks you can complete:
- To administer a StreamSets environment for a project, you must have the Editor or Admin role in the project.
- To create a StreamSets flow and run a job for the flow, you must have the Editor or Admin role in the project.
- To view the job run details created for a StreamSets flow, you can have the Viewer, Editor, or Admin role in the project.
Streaming real-time data tasks
Complete the following high-level tasks to stream real-time data:
- Administer StreamSets
environments.
Create a StreamSets environment to configure Data Collector engines for your project. Then run the engines in your corporate network.
- Create StreamSets
flows.
Create a StreamSets flow to define how data flows from source to target systems and how the data is processed along the way.
- Run jobs.
Run a job for a finished flow. The job runs on the environment that is selected for the flow.