Although InfoSphere Streams can be used in a single-server environment, its ability to analyze massive volumes of data with world-class performance is achieved through a scale-out architecture running on a Linux® cluster. A working cluster is essentially a prerequisite for a scale-out Streams environment, and Streams does not dictate how you meet that prerequisite. How you create and maintain your cluster is a challenge that requires the appropriate tools and can significantly affect your cost of doing business. Even in an environment made up of a few nodes, keeping them in sync with one another can be challenging. But if your environment has tens, hundreds, or even thousands of nodes, a repeatable, automated methodology for setup and management of your Streams environment is essential.
This article describes how you can use IBM Platform Computing products to create and manage the Linux cluster(s) that host your Streams environment(s). Platform computing provides a broad set of capabilities that can be used to bridge the gap between available server hardware and a working Linux cluster to host Streams. And it goes beyond the setup of the cluster, also providing the capability to orchestrate the installation and configuration of Streams and other software you require in your environment. And after your environment is deployed and working, IBM Platform Computing provides the ability to manage the environment, including adding or removing capacity from the cluster, updating software and monitoring health and performance.
In many cases, your business needs may require multiple Streams environments. This article also introduces several common multi-environment scenarios and describes how IBM Platform Computing can be used to support them. This includes a discussion of how to use it to achieve a cloud environment in your enterprise to support multiple on-demand Streams environments.
What you'll need
In order to create and manage your Streams environment using IBM Platform Computing, you'll need the following:
- InfoSphere Streams (3.1 is the latest)
- IBM Platform Cluster Manager Advanced Edition (PCM-AE) (4.1 is the latest)
- A set of servers that you want to run Streams on
- A management server that PCM-AE will use to control the cluster (if you require a highly available cluster environment, you'll need one or more additional servers configured as backup management servers)
- A web browser with the Adobe Flash plug-in installed on any machine that you want to use as a management client
- Appropriate operating system and software packages required by Streams
PCM-AE will first need to be installed and configured as described in its product documentation. Once PCM-AE is in place, you can begin to use it to create and manage your Streams environments.
Setting up a Streams cluster
If you set up a Streams environment manually, it consists of a series of steps to first establish a working cluster, then another series of steps to set up Streams. These tasks prepare the environment in an incremental fashion until the setup is complete. With PCM-AE, the setup of a cluster of hosts is performed as a single transaction, consisting of a layered set of operations you've defined using the tooling.
The tooling will be described shortly, but first we'll briefly introduce the logical layers that need to be in place for a basic Streams environment:
- Operating system — The first step is to get the OS installed on a set of machines.
- Shared file system — Streams requires a shared file system accessible from all of its hosts.
- Cluster users and groups — Typically, users and groups should be defined consistently across all nodes in a Streams cluster (or in any cluster).
- Streams software prerequisites — Software packages Streams requires need to be installed before the installation of the Streams software.
- Streams software — The Streams product software needs to be installed after all of its prerequisites are met. Before automating the setup of Streams using PCM-AE for a large cluster of machines, it is recommended that you perform the tasks manually on one or two servers to understand and validate the steps involved and to generate some artifacts that will be used in by the automated setup. This includes running the Streams dependency checker to understand the prerequisite software packages that need to be installed before you can install Streams, and the generation of response file that can be used to perform a silent install in the automated setup.
Defining a Streams cluster with the Cluster Designer
With PCM-AE, you can define the layered setup described in the prior section by using the Cluster Designer, which is part of its management portal that runs in your browser. To locate the Cluster Designer in the management portal, click on the Clusters tab on the left side of the browser window. Next, select Definitions from the menu on the left, as shown in Figure 1. You will see the list of cluster definitions displayed on the right. Click New to create a new cluster definition. This will launch the Cluster Designer in a new window.
Figure 1. Navigation to the cluster definition area in the portal
Figure 2 shows the new Cluster Designer window in its initial state. Notice the stacks of shapes on the canvas. The terminology of the Cluster Designer refers to each stack a tier, while the parts of each tier are called layers.
Figure 2. Initial view of a new cluster definition
The initial state of the Cluster Designer assumes two tiers, corresponding to a cluster with a master/compute node topology, where there are two types of machines in the cluster: a master node (or nodes) to manage workload deployments and a set of compute nodes that the master would assign the workloads to. But the topology of a Streams cluster is different. From a Streams cluster point of view, all the nodes use the same Streams software, so initially we can think of a Streams cluster having only one kind of node. One of our first changes to the cluster definition will be to delete a tier from this definition.
Perform the following steps to begin the cluster design:
- Give the cluster a name and a description by filling in values on the properties area on the bottom section of the page.
- Delete one of the tiers by clicking on the rounded shape called ComputeNodes, which should select the stack representing the entire tier, and clicking the delete icon (trash can) in the upper right.
- Rename the other tier called Master by clicking on the rounded shape and updating the properties for the tier. We can give this tier the name Streams.
- Click the save icon (disc) in the upper right to save the definition.
This creates a cluster definition we can use to define how a typical Streams node will be set up.
Figure 3. The beginnings of a cluster definition with a single tier
The following sections describe how to define the layers for the Streams tier in the Cluster Designer.
Each tier has a special layer called the machine layer that defines the physical or virtual machine and the OS that will be installed. It defines a variety of properties dealing with hardware, OS, network, storage, and resource placement policies. Ultimately, when PCM-AE provisions the cluster for you, it will use this information to select which resources to provision and to set up the nodes.
We can perform the following actions to set up the machine layer in this cluster definition.
- Rename the machine layer (currently called Master) to RHEL by updating its properties.
- Go to the OS tab for the machine layer and select one of the available OS images for your Streams nodes, as shown in Figure 4. Select one of the templates ending in install-PMTools, as they have been prepared for use in PCM-AE clusters. (The OS images are created during PCM-AE's post-install setup while configuring its embedded xCAT provisioning engine.)
- Other properties can be set using the other tabs, including specifying a minimum and maximum number of machines for the cluster, or adding selection policies to control how servers will be chosen from the pool of available resources when you deploy your cluster.
Figure 4. Editing the machine layer properties (OS tab selected)
As you may have noticed in Figure 4, there is a box below the machine layer of the Streams tier called Pre-install script and another above the machine later called Post-install script. These placeholders represent script layers that can be added to a tier in the Cluster Designer to execute before or after the OS install defined in the machine layer.
To set up our Streams cluster nodes, we will use script layers to drive the additional installation and configuration tasks required for our environment beyond the base OS install:
- To create a post-install script layer to set up the shared file system, drag a New Script Layer element from the lower-left part of the Cluster Designer and drop it on the empty Post-install script box.
- In this environment, we'll be using NFS. We'll give our new script layer the name Mount NFS by updating its properties.
After completing the two steps above, our cluster definition is shown in Figure 5, but there is still more work to do to define the Mount NFS script layer.
Figure 5. Cluster Designer after a post-install script layer is added
This layer is referred to as a script layer because the layer needs to be associated with a script that PCM-AE will execute as each node is provisioned. In this case, a script to mount NFS on a node is needed.
As shown in Figure 5, this view allows you to click Import to use an existing script from a file, or Edit to enter the script directly.
If your script requires any parameters, the User Variables tab in the script layer properties can be used to specify them. For example, a script that mounts NFS might take the server name and mount point as parameters. The User Variables tab allows you to define basic parameters that get passed to the script or more sophisticated variables such as multi-value variables, other scripts, and password variables.
Script layers typically run on the machine before/after it is created, but you can also configure specific scripts to run only for certain actions, like when the cluster grows or shrinks, or when a machine is deleted. In addition, you can configure a script to run on the management server instead of on the provisioned machine. Use the Execution Properties tab, also visible in Figure 5, if you need to override the default behavior for a specific script.
After adding the layer to mount NFS, we add scripting layers to the tier:
- Users and Groups — Create the set of users and groups on each node needed in your Streams environment. This includes the user id that will own the Streams installation. In this step, the configuration of ssh keys (required by Streams) can also be performed for each user.
- Streams Prerequisites — Install the software packages required by Streams (e.g., gcc-c++, Perl packages, Java, etc.). The prerequisites for Streams are detailed in the product documentation, and any missing packages can be discovered by running the automated dependency checker provided by Streams.
- Install Streams — Often, Streams is installed once into the shared file system, and Streams executables, libraries, etc. are used by multiple nodes from that shared location. Alternatively, you may decide to install Streams locally on every node. A script that installs on every node needs to copy the Streams install package to the node and unpack it, then run a silent install of Streams on the node using the response file generated from a manual install.
In Streams 3.1, you have another option if you want to install Streams locally on every node. You can now build your own customized Streams RPM (following the instructions in the Streams product documentation) and use the RPM to install on individual nodes. Figure 6 shows the Streams cluster definition after adding these scripting layers.
Figure 6. Cluster Designer after layers have been added
The layers described so far are sufficient for setting up a Streams cluster. Additional layers could be added if you'd like to go beyond just setting up just the Streams platform, such as:
- Installation of other software — If your Streams applications rely upon other software packages or products, layers can be added to install those as well. Also, if your enterprise has other workloads that need to run on the same cluster, software to support those efforts can also be installed.
- Installation of Streams Studio — You may be setting up your cluster to run existing Streams applications, but if you also intend for it to support development of new applications, you could add a layer to install Streams Studio. Streams Studio is an Eclipse-based graphical environment for developing Streams applications.
Figure 7 shows a cluster definition for a Streams environment where all nodes have the ability to run Streams applications, and some nodes also have the Streams Studio development environment installed on them. This scheme uses two tiers, nearly identical, except for the layer that adds Streams Studio. Notice that the scripting layers in both tiers use common elements. PCM-AE allows you to save and reuse Prebuilt Elements in your personal workspace, and it also gives you the ability to share these with a wider scope of users, so they can reuse them in their cluster definitions.
Figure 7. Cluster Designer with a two-tier definition of a Streams cluster, providing development tooling on a subset of the nodes
As stated, you can do more in your cluster definition than just set up Streams. The needs of your enterprise will define exactly how the nodes in your Streams cluster will need to be set up, and PCM-AE provides the function and flexibility required to accomplish this.
Deploying the Streams Cluster
Once a cluster is defined, one or more instances of the cluster can be deployed. (See the "Multi-Cluster Scenarios" section for a discussion of why you might use multiple Streams clusters and how you would deploy them.)
Before deploying an instance of the Streams cluster, the cluster definition first needs to be published. To publish a cluster definition, you simply select an unpublished item in the Cluster Definitions list then choose the Publish action.
After the cluster definition is published, it is ready to be deployed. When a cluster is deployed, PCM-AE's provisioning engine will find the right types of machines from the pool of available capacity. It will then allocate these machines for the Streams cluster and will provision the machines according to your cluster definition. The OS will be installed as defined in the machine layer of the cluster definition, and the scripting layers of your cluster definition will be run to orchestrate the other setup steps.
To deploy your cluster definition, navigate to the Cluster tab and select the Cockpit menu, then click New. Next, select your cluster definition from the list of published definitions, and click Instantiate, as shown in Figure 8.
Figure 8. Streams Cluster definition selected in the Clusters Cockpit view, just before initiating the deployment of the cluster
After clicking Instantiate, the Create Cluster view is displayed, allowing you to customize the cluster deployment details and policies. There are many aspects you can customize for the cluster you are instantiating, including the name, the run time (i.e., how long the cluster will be instantiated), and criteria for the machines that will be chosen for the cluster. You can use the User Variables tab to assign values to variables used in the scripts you defined in your scripting layers (e.g., the location of your NFS server.)
Figure 9 shows customizations to create Streams cluster with no expiration time. Eight servers will be configured according to the Streams tier of our cluster definition, and two will be configured according to the StreamsAndDev tier. The resulting cluster will contain 10 servers, all with the ability to host Streams applications. Two of the servers will also support the development of Streams applications.
Figure 9. Cluster deployment details and policies
Monitoring and maintaining the Streams cluster
If you use PCM-AE to create your Streams cluster, it is only the beginning of the benefits you can realize by using PCM-AE. Once the Streams cluster is deployed, IBM Platform Computing offers many valuable features that make it easy to manage and maintain your cluster.
Streams provides a valuable set of monitoring capabilities throughout the product to ensure that your Streams instances, jobs, etc. are healthy. Performance metrics are also provided that allow you to monitor tuple rates, detect congestion, and otherwise understand how your Streams applications are performing. The monitoring capabilities within Streams are Streams-centric and need to be complemented with additional resource-centric monitoring to meet the needs of an enterprise. PCM-AE provides a set of resource-centric monitoring capabilities that are complementary to the monitoring that Streams provides, and the management portal provides a consolidated view for the resources that belong to a cluster.
Among the PCM-AE monitoring capabilities:
- Status of each machine in the cluster
- Performance charting aggregated across the entire cluster, or for an individual machine
- Configurable monitoring policies for your cluster using a set of rule-based alarms and automated actions
- The ability to add new metrics, alarms, and actions to those provided with PCM-AE
Since Streams applications typically run continuously for days, months, or even years, monitoring capability is essential. With the addition of PCM-AE monitoring, a number of situations relevant to Streams applications can be monitored for questions such as:
- Is a node in the cluster down?
- Has a node in the cluster reached a critical threshold?
- Is the resource consumption significantly uneven across various nodes in the cluster, suggesting that a different mapping of Streams application elements to the nodes in the cluster might be in order?
- Has the resource consumption in the cluster been as expected over the past few minutes, the past hour, the past day or the last week?
- Is there excess resource in my cluster that could be used to run additional Streams applications, and perform additional analysis?
PCM-AE provides all the capability you would need to maintain and manage your Streams cluster. This includes the ability to create, destroy, flex (grow or shrink), modify, and debug the cluster.
PCM-AE also provides the capability to manage the life cycle of the servers in your cluster, allowing you to quickly add servers. The management portal allows you to launch remote consoles to any server. Other server-level management capabilities are also provided, including power management, placing a server into maintenance mode, and handling firmware updates.
Other controls allow the PCMAE administrator to delegate control of portions of the resources to other business groups.
With the ability to manage your Streams cluster comes the ability to handle more sophisticated scenarios that many enterprises require involving more than one Streams cluster. This section explores several common, multi-cluster Streams scenarios, and explains how IBM Platform Computing can be used to manage them.
Scenario 1: Development and production
Often, an enterprise that uses Streams needs to support at least two environments:
- A development environment where applications are created and tested
- A mission-critical environment where applications execute in production
Because applications created and tested in the development environment are on a path to production, it is important that the two environments are essentially mirror images of one another. With PC-MAE, you can create the Streams development and production cluster instances from the same cluster definition to ensure that the two environments are identical (with the exception of intended differences such as the size of the clusters.) Or, if there are architected differences between the two environments (such as the presence of development tooling), two cluster definitions that share a set of prebuilt elements could be used. PCM-AE allows you to create the clusters consistently, to ensure that any differences between them are intended and not accidental.
In the development/production scenario, a common task is to promote development artifacts into production. To accomplish this, applications from the development environment must be made available to the production environment. One way to achieve this is to use the GPFS file system support provided in PCM-AE, allowing you to manage pre-created GPFS file systems and have them automatically mounted by both clusters. This shared area can be used for the staging and promotion of artifacts from development to production.
Scenario 2: Multiple organizations using Streams
Imagine that there are two organizations that use Streams to analyze data, derive knowledge and make decisions. PCM-AE provides the capabilities to create two separate Streams clusters, each with their own resources. Even though the environments are distinct from one another, they can be managed from a single console and prebuilt elements can be reused in both environments where appropriate.
When serving multiple organizations, it's important to achieve the right level of separation between the clusters. On one end of the spectrum are clusters that need to be completely isolated from one another for security reasons. On the other end are clusters being used in a cooperative fashion, where interaction between the organizations using the clusters is essential.
With PCM-AE, you can decide how much separation you want — from a shared environment where all servers are visible to every other server to isolated clusters where the servers from one cluster are not visible from another cluster. The shared environment is provided by default. Isolated clusters require a combination of IP management and dynamic VLAN configuration.
Scenario 3: Migration to a new version of Streams
Cluster management using PC-MAE can also help you transition to a newer release of Streams. Imagine a large cluster running Streams in production. When a new release of Streams becomes available, it might be prudent to test the production applications against the new release in a piecemeal fashion. A small cluster environment can be created for running the new Streams version, and applications can be deployed in the environment in an incremental fashion. As applications are moved to the new release, the management portal can be used to flex down (shrink) the original environment's resource pool, and these resources can be redeployed to the new environment. Ultimately, the new environment replaces the old one, and the production cluster is now running the latest version of Streams. Once again, IBM Platform Computing provides the single point of control for the transition between clusters.
A similar approach can be used for implementing operating system upgrade scenarios, or for upgrading other software in the environment.
Scenario 4: Running Streams in the cloud
Using PCM-AE, you can create a cloud-like environment for your Streams users in your enterprise. You can create multiple Streams clusters supporting different goals of the enterprise. This might range from environments for experimental data analysis being used to explore different hypotheses to mission-critical production environments that run a validated set of applications using proven analysis techniques. Different organizations can be given their own cluster, and computing resources can be moved from one environment to another as necessary.
Achieving a cloud-like environment in your enterprise does not require a lot of additional work. Once you've created one cluster environment using PCM-AE, you're most of the way there because PCM-AE consolidates management of all your cluster environments into one centralized interface to simplify infrastructure management.
InfoSphere Streams is helping organizations meet the challenge of processing big data in real time. IBM Platform Computing has been managing large clusters successfully for many years. Using the two products together simplifies the adoption of big data through a cluster management methodology and tooling to efficiently manage your Streams infrastructure.
- Get an overview of InfoSphere Streams in "An introduction to InfoSphere Streams: A platform for analyzing big data in motion."
- Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources.
- Find more details about InfoSphere Streams in the IBM Redbooks publication titled "InfoSphere Streams: Harnessing data in motion."
- Get an overview of IBM Platform Computing Cluster Manager and explore Platform Computing solutions in the IBM Redbooks titled "IBM Platform Computing Solutions."
- Find more details about Platform Computing in the IBM Redbooks titled "Accelerating Business Results for Compute and Data-intensive Applications."
- Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- Download InfoSphere Streams, available as a native software installation or as a VMware image.
- Use InfoSphere Streams on IBM SmartCloud Enterprise.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Ask questions and get answers in the InfoSphere Streams forum.
- Check out the developerWorks blogs and get involved in the developerWorks community.