Getting started with real-time stream computing

How InfoSphere Streams helps turn data into insight

Use InfoSphere® Streams to turn volumes of data into information that helps predict trends, gain competitive advantage, gauge customer sentiment, monitor energy consumption, and more. InfoSphere Streams acts on data in motion for real-time analytics. Get familiar with the product and find out where to go for tips and tricks that speed implementation.

Jacques Roy (jacquesr@us.ibm.com), WW Technical Sales, Big Data, InfoSphere Streams, IBM

Jacques RoyJacques Roy has worked in many technology areas, including operating systems, databases, and application development. He is the author of books, IBM Redbooks, and developerWorks articles. He has also been a presenter at many conferences, including IBM's Information on Demand (IOD).



10 September 2013

Also available in Chinese Russian

Setting the stage

IBM introduced the term "smarter planet" several years ago. With this, it described three main attributes:

InfoSphere Streams Quick Start Edition

InfoSphere Streams Quick Start Edition is a complimentary, downloadable, non-production version of InfoSphere Streams, a high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources. With no data or time limits, InfoSphere Streams Quick Start Edition enables you to experiment with stream computing in your own unique environment. Build a powerful analytics platform that can handle incredibility high data throughput, up to millions of events or messages per second. Download Streams Quick Start Edition now.

  • Instrumented
  • Intelligent
  • Interconnected

This has been becoming a reality for quite a while now. Think of the proliferation of smartphones that can actually do a lot more than keeping us connected. Smartphones include a set of sensors from GPS to temperature and humidity. Add to that all the sensors and meters in use around the world and we see a reality that continues to evolve.

That constitutes a deluge of data that begs to be converted into information. We could take advantage of the power of crowd-sourcing to get insight into changing situations, for example. We can detect trends that could generate opportunities or avoid costly disasters. We can see that in real-time analysis of social media sentiments, current status of energy consumption, analysis of key performance indicators in telecommunication and other industries. Going further into examples is beyond the scope of this article.

There are a few major issues with taking advantage of this new data opportunity. One is the sheer volume of data, and another is the ephemeral nature of a lot of the data. How can we quickly sift through what is coming through and focus on the parts we are interested in at the moment?

Of course you can expect that my answer leads to InfoSphere Streams. First, we need to look at the difference between data at rest and data in motion.


Data at rest

It is safe to assume that everyone is very familiar with data at rest: files and databases. No matter where the data comes from, it is stored on a disk before it is used. We saw its evolution over the years; files were difficult to update and concurrent updates were just impossible. Then came hierarchical/network databases, but then the organization of the data was optimized for one specific use case and the worst-case access scenario was left for virtually all other use cases. This led to the adoption of relational databases.

Eventually, the relational model took hold and evolved to include extensibility in the form of the object-relational model. The most successful characteristic of this model is the ability to include additional data types and functionality to answer business needs such as processing spatial data and doing text search. It can go as far as providing specialized storage and processing, as demonstrated by the Informix® TimeSeries capabilities.

Over the past few years, we saw an additional push for the analysis of the need to analyze unstructured data, which led to the rise of Hadoop (or the enterprise-ready InfoSphere BigInsights™ and satellite products). These advances make it easier to analyze large volumes of data, but they still require data to be stored on a disk before it's processed.

Data at rest is a model that is here to stay, and it will continue to expand as it helps in the big data challenges. The rise of real-time analytics demand the addition of another model: data in motion.


Data in motion

Data in motion is also a known concept. Think of a movie — this is data in motion. Not because people move on the screen but because it is a flow of images that passes through. Each image is there for an instant and disappears.

In any software application, the data must first be put in motion before it can be acted upon. It flows from one function to another, one thread to another, one process to another, and one computer to another.

Imagine the performance implications for the application if data had to be put at rest before acting on it. The constant conversion between data in motion and data at rest would make any significant processing impossible. Efficient processes try to limit the need to put data at rest as much as possible since disk drives are the slowest components of a computer system.

Another advantage of data in motion is the fact that we don't have the issue of having specialized storage mechanisms that work on optimizing retrieval of data in a sea of similar data. If data is in motion, you already have it. It you have to retrieve it from an at-rest repository, you need to figure out how to retrieve it. This is through sequential or indexed access. Either way, it involves a lot more than just reading the data as it passes through.


Real-time processing

Before we go any further, we need to clarify the term real-time processing. There are situations in which access to processing and the amount of time required to process data must be guaranteed by the environment. To ensure a specified response time, the executing program cannot paused for any reason. In these situations, the processing must run in a specialized operating environment, under a particular operating system supporting this type of scheduling.

In a looser context, real-time processing can mean processing data anywhere from within a fraction of a second to minutes or even hours. What is really at stake here is the latency between the availability of the data and creation of information. Your data may be coming in burst every 15 minutes, but the latency is the time between the burst and the information availability.

Many business cases, such as social data analysis depend on defined latency levels for the processing usefulness. The key to project success can depend on how much we can reduce this latency.

Keep in mind that this discussion does not take anything away from the data at rest model. It is still needed in many cases for less-stringent "real-time" requirements and, of course, for in-depth analysis of historical data. This analysis can be key to adapting the real-time processing to the business changing reality.


What is InfoSphere Streams?

One of my colleagues came up with this very short definition: "A platform for real-time analytics on big data."

The "big data" part refers to the volume, variety, and velocity of the data. We are talking about any type of data (variety), potentially terabytes (volume) coming at you very quickly (velocity).

When we say "analytics," we refer to the capability to process/analyze the data through custom programming, including the use of tools such as SPSS® or the R project environment.

The "real-time" part is the fact that latency is eliminated as much as possible by processing data in memory.

Finally, "platform ... for big data" refers to the InfoSphere Streams ability to transparently distribute processing over a cluster of machines for scalability, and monitor and manage the environment.

What does the platform provide and how do we use it? Good questions. Before we get to that, let me introduce a one-stop shop for InfoSphere Streams information.


Introducing the InfoSphere Streams Playbook

There is a lot of information available about InfoSphere Streams. It comes in the form of documentation, IBM Redbooks®, developerWorks articles, training videos, use cases, and more. How do you find and navigate this information?

A developerWorks wiki called the InfoSphere Streams Playbook is available for this purpose (see Resources). If for any reason you can't remember the URL, a Google search for "Streams Playbook" should list it in the first result page.

This wiki includes the following sections:

  • Reference material
  • Video tutorial
  • Video use cases
  • Other use cases (currently empty)
  • Ecosystem
  • Developer corner

The rest of this article refers to sections of this wiki for more information.


What is InfoSphere Streams for?

InfoSphere Streams was developed for security applications in which a large amount of data had to be processed quickly to help prevent problems. Because data from disparate systems used various formats, the requirements included:

  • Support for any type of data
  • The need to reduce latency with in-memory processing
  • The ability to scale by using cluster support

InfoSphere Streams is also used in the telecommunications industry to enable telecom companies to react quickly to customer issues. When a customer experiences dropped calls, the company can take proactive action to compensate the customer automatically and maintain customer satisfaction, rather than waiting for the customer to complain or look for another carrier. In some highly competitive markets, customer satisfaction is equivalent to customer retention. Eliminating customer defection is critical.

InfoSphere Streams is also being used in transportation, healthcare, and other industries. See the "Video use cases" section of the playbook for more information.


InfoSphere Streams environment

Although InfoSphere Streams can run on a single machine, it was designed to run on a cluster to provide virtually unlimited scalability. A machine can serve as a management host, an application host, or a mixed-mode host, which runs both management services and application services and code.

Figure 1. A mix of management hosts and application hosts in a cluster
Image shows management hosts and application hosts running management services and applications

Notice that the management node runs a set of services that keep track of the health of the cluster and the running of jobs. A job could be looked at as a program. The difference between a job and a program is that it is made up of operators that can be schedule to run on any available application host. A job is composed of multiple processes, whereas a program runs in one process. Learn more about the InfoSphere Streams runtime environment in the Information Center and in the "Video tutorial" section of the InfoSphere Streams Playbook (see Resources).

InfoSphere Streams includes several tools to help you manage the environment and develop applications for InfoSphere Streams:

  • FirstSteps— Perform post-install tasks, such as configuring SSH, generating public and private keys, configure the recovery database, etc.
  • Instances manager— Create, configure, update, and remove instances and clusters. The instances manager can also launch the InfoSphere Streams console.
  • InfoSphere Streams console— Web-based GUI used to monitor and manage instances and applications.
  • InfoSphere Streams Studio— Eclipsed-based tooling used to develop applications. It includes a graphical editor and wizards to simplify standard development tasks. It also includes visualization of running InfoSphere Streams jobs and their operators.
  • streamtool— Command-line tool used to automate management and monitoring tasks.
  • InfoSphere Streams compiler (SC)— Used through the tooling to automate compiling tasks. It can also be used directly or through makefiles.

The graphical tools (the first four bullets in the list) help to greatly increase your productivity, once you learn them. The learning curve is short and is worth it.


InfoSphere Streams programming

InfoSphere Streams programming starts with putting together operators to modularize processing and enable concurrent execution. A graphical editor makes it easier to connect operators together to create a processing graph for a job.

InfoSphere Streams also uses a procedural language called Streams Processing Language (SPL). If you are familiar with C, the Java™ programming language, and Python, you will become fluent in SPL in no time.

One aspect of InfoSphere Streams that may be different from most languages is its extensive set of available data types, as shown below.

Figure 2. Data types available in InfoSphere Streams
Image shows data types available in InfoSphere Streams

With these types, you can match virtually anything. You can even create your own data types based on the primitive types provided.

In addition to the data types, InfoSphere Streams comes with a significant set of functions in categories including collection manipulation (list, set and map), files, math, string, time, and utility functions. In total, InfoSphere Streams includes more than 250 functions, in addition to several with the same names that operate on different data types and functions specific to a given operator. You can also add to the count functions from specialized toolkits included with InfoSphere Streams.

InfoSphere Streams programming starts with operators. Operators manage getting data from the outside world through a variety of means, perform data transformation, and provide the final results. It allows you to put together the framework to a solution very quickly and get to a solution to solve your business problems.


Operators and toolkits

Operators constitute a higher level of abstraction than procedural programming. With operators, you have a logical separation of processing that lends itself to processing distribution. The operators come from different packages called toolkits. One in particular — the standard toolkit — is always included in any InfoSphere Streams project. It provides all the functions that come with the product. Although the functions may be seen as part of the language, they are separated out because they are defined in the standard toolkit.

A toolkit is a package of capabilities generally associated with a specific problem domain. These can include operators, functions, and data types. The following toolkits are available in InfoSphere Streams 3.1:

  • Standard toolkit— Includes operators as adapters, relational operations, utilities, and XML
  • Big data— Interface with Hadoop Distributed File System (HDFS) and Data Explorer
  • Complex event processing— Used to defined how to process complex events
  • Database— Access to relational databases such as DB2®, Informix, Netezza, Oracle, SQL Server, Teradata, etc.
  • Financial services— Includes a set of operators and functions used for dealing with financial market processing
  • Geospatial— Provides a set of functions to manipulate geospatial data
  • InfoSphere DataStage® integration— Interface with DataStage
  • Internet— Access to HTTP(S), FTP(S), and RSS
  • Messaging— Communicates with WebSphere® MQ and ActiveMQ
  • Mining— Provides a way to do scoring on data
  • R Project— Interface with the R Project statistical environment
  • Text— Provides text analytics
  • TimeSeries— Facilitates manipulation of time series, which is data processing based on time

The standard toolkit comes with a set of 36 operators and more than 256 functions. The other toolkits add more than 40 operators and a few dozen functions. With all this pre-built functionality, you can be very productive in creating custom jobs to answer your business needs.

This short description gives you a glimpse at the capabilities provided by InfoSphere Streams. Let's take this one step further and look at how InfoSphere Streams fits into a complete data processing environment.


InfoSphere Streams integration

InfoSphere Streams does not operate in a vacuum. It is a key component of the IBM big data platform. As part of an enterprise integration, InfoSphere Streams can read and write from different sources and targets, such as files, network connections (TCP, UDP, HTTP, etc.), and message queues (WebSphere MQ and Active MQ). InfoSphere Streams has operators to interface with the financial markets using the Financial Information eXchange (FIX) protocol.

In addition to these interfaces, InfoSphere Streams works with several IBM products, including InfoSphere BigInsights, Data Explorer, SPSS, and Cognos®. In addition, it can take advantage of the R Project statistical analysis tooling. As InfoSphere Streams evolves, interfaces are added.

For more information, consult the "Ecosystem" section of the InfoSphere Streams Playbook.


Navigating the Information Center

The Information Center includes information for systems administrators, InfoSphere Streams administrators, and application programmers. This section focuses on the information needs of the application developer.

Figure 3 shows the main sections and subsections of the Information Center. As a developer, you may want to go through all the sections to get familiar with everything related to InfoSphere Streams. To get a sense of the programming environment, look over the Developing section and learn about InfoSphere Streams Studio, then try the tutorial.

Figure 3. Information Center sections
Image shows topics covered in the Information Center for InfoSphere Streams

After your initial browsing, you will likely spend most of your time in the Reference section. You need to learn the SPL language, the available functions, the available operators, and how to use them.

Under Reference > Language reference, there are two sections: Annotation Query Language (AQL) and Streams Processing Language (SPL). If you are not planning to use text analytics and the TextExtract operator under the Text toolkit, you can ignore the AQL section for now. When you decide to do text analytics, you will want to use the InfoSphere BigInsights 2.1 Information Center, which includes more information on the topic than the InfoSphere Streams Information Center.

The SPL language section contains all the information to get familiar with the SPL language, the use of data types, functions, and how toolkit operators are invoked. Because I've had this problem, I'd like to point out one specific section that may either go unnoticed or that you may have trouble finding again: Streams Processing Language > Expression language > Expression operator. If you want to know how to do a logical or bitwise expression, this section lists what you can use when writing functions of code segments. There is one thing that I did not find obvious on this list: how to concatenate strings. (You can use the + operator for that.)

Once you are familiar with the language, your next question might be, "Which functions are available to manipulate my data?" The short answer is that SPL does not include any functions. This may be true, but any application includes the "standard toolkit" implicitly. That's where you want to look for manipulation functions: Toolkit reference > SPL Standard Toolkit > Built-in SPL Functions > FunctionList. This section includes functions for file manipulation, math operations, string manipulation, time manipulation, and XML processing. Utility functions for logging, tracing, assert, etc. are also included. The easiest way to find information for a function is to search within the page (using Ctrl+f) on the function name or relevant keywords.


Code example

At this point, you should have all the information you need to get started with your InfoSphere Streams programming. Using the InfoSphere Streams Studio and GUI editor, you can easily drag and drop the operators you want to use and connect them together. For more complex processing, consider adding your own code snippets. Some operators will require additional learning time.

One of the quickest ways to learn is by using code examples. You can find many references to examples and samples in the InfoSphere Streams Playbook under "Developer corner." In addition, check out these helpful resources (see Resources):

  • developerWorks articles — InfoSphere Streams articles on the developerWorks site.
  • InfoSphere Streams blog — Entries on all areas of InfoSphere Streams programming, from simple to complex.
  • Streams exchange — Code contributions from the community. Examples include video processing using OpenCV and a toolkit to interface with HBase. Under "Applications," see "Examples for beginners," which I highly recommend as you start using Streams and operators.

Conclusion

InfoSphere Streams is a powerful "platform for real-time analytics on big data." It is a key element of the IBM big data platform. You can quickly become productive by using the tools provided by InfoSphere Streams and take advantage of all the information made available for its use. With all this, you can turn data into actionable insight.

Real-time data processing — perhaps the next big thing in data processing — is being used in many interesting and demanding applications to extract insight and competitive advantage from big data.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=943358
ArticleTitle=Getting started with real-time stream computing
publish-date=09102013