Try IBM StreamSets
This tutorial covers the steps needed to try IBM StreamSets. Although the tutorial provides a simple use case, keep in mind that IBM StreamSets enables you to build, run, and monitor large numbers of complex pipelines.
- IBM StreamSets as a Service
- Use the following URL to sign up for a free trial: https://cloud.login.streamsets.com/signup
- IBM StreamSets as client-managed software
- An administrator must install the IBM StreamSets service on IBM Software Hub and give you access to the service. To determine whether the service is installed, open the Services catalog and check whether the service is Installed, Ready to use.
Build a Pipeline
Build a pipeline to define how data flows from origin to destination systems and how the data is processed along the way.
This tutorial builds a pipeline that reads a sample CSV file from an HTTP resource URL, processes the data to convert the data type of several fields, and then writes the data to a JSON file on your local machine.
The sample CSV file includes some invalid data, so you'll also see how errors are handled when you preview the pipeline.
Deploy an Engine
IBM StreamSets allows you to unlock your data without ceding control. Deploy a Data Collector engine to your local machine to maintain all ownership and control of your data.
Run a Docker Image
Compatible for most operating systems.
Download and Install from a Script
Not compatible with Windows. Windows users must run a Docker image.
Troubleshooting
Use the following troubleshooting tips for help with deploying an engine:
- When I run the command to deploy an engine, I get the following error:
-
Could not resolve host: na01.hub.streamsets.com
- When I run the Docker command to deploy an engine on Linux, I get the following error:
-
permission denied while trying to connect to the Docker daemon socket
- When I run the Docker command to deploy an engine on Windows, I get the following error:
-
Error: error during connect: this error may indicate that the docker daemon is not running
- When I run the Docker command to deploy an engine on Windows, I get the following error:
-
Error: error during connect: in the default daemon configuration on Windows, the docker client must be run with elevated privileges to connect
- When I try to start Docker Desktop on Windows, I receive a message that Docker Desktop requires a newer WSL kernel version.
- On some Windows machines, you must update the WSL kernel to the latest version before you can start Docker Desktop.
Preview the Pipeline
Now that you've deployed an engine, you can preview the pipeline.
Run a Job
Next, you'll check in the pipeline to indicate that your design is complete and the pipeline is ready to be added to a job and run. When you check in a pipeline, you enter a commit message. IBM StreamSets maintains the commit history of each pipeline.
Jobs are the execution of the dataflow. Jobs enable you to manage and orchestrate large scale dataflows that run across multiple engines.
Since this pipeline processes one file, there's no need to enable the job to start on multiple engines or to increase the number of pipeline instances that run for the job. As a result, you can simply use the default values when creating the job. As you continue to use IBM StreamSets, you can explore how to run pipelines at scale.
Monitor the Job
Next, you'll monitor the progress of the job. When you start a job, Control Hub sends the pipeline to the Data Collector engine deployed to your local machine. The engine runs the pipeline, sending status updates and metrics back to Control Hub.
Next Steps
- Invite others to join
- Invite other users to join your organization and collaboratively manage pipelines as a team.
- Modify your first pipeline
- Modify your first pipeline to add a different Data Collector destination to write to another external system. You can also add additional processors to explore the other types of processing available with Data Collector pipelines.
- Complete a more detailed tutorial
- Complete a more detailed Data Collector pipeline design tutorial to learn about additional processors, how a pipeline can process data in two branches, and how to use data rules to raise an alert during pipeline processing.
- Explore sample pipelines
- Explore the sample pipelines included with Control Hub.
- Explore engines
-
- Compare the IBM StreamSets engines - learn about their differences and similarities.
- Set up and deploy an engine in your cloud service provider account, including Amazon Web Services (AWS) or Google Cloud Platform (GCP).
- Learn how engines communicate with Control Hub to securely process your data.
- Explore team-based features
-
- Learn how teams of data engineers can use Control Hub to collaboratively build pipelines. Control Hub provides full lifecycle management of the pipelines, allowing you to track the version history and giving you full control of the evolving development process.
- To create a multitenant environment within your organization, create groups of users. Grant roles to these groups and share objects within the groups to grant each group access to the appropriate objects.
- Use connections to limit the number of users that need to know the security credentials for external systems. Connections also provide reusability - you create a connection once and then other users can reuse that connection in multiple pipelines.
- Use job templates to hide the complexity of job details from business analysts.
- Explore advanced features
-
- Use topologies to map multiple related jobs into a single view. A topology provides interactive end-to-end views of data as it traverses multiple pipelines.
- Create a subscription to listen for Control Hub events and then complete an action when those events occur. For example, you can create a subscription that sends a message to a Slack channel or emails an administrator each time a job status changes.
- Create a sequence to run a collection of jobs in sequenced order based on conditions.
- Schedule a job to start or stop on a weekly or monthly basis.