Choosing the right platform for high-performance, cost-effective stream processing applications
The core features comprising Watson Data Platform, Data Science Experience and Data Catalog on IBM Cloud, along with additional embedded AI services, including machine learning and deep learning, are now available in Watson Studio and Watson Knowledge Catalog.
In recent years, technologies that enable organizations to capture, process, ingest and analyze large volumes of data at high velocity have become increasingly important.
The current trend can probably be traced back to the rise of social media in the mid-to-late 2000s. Companies like Facebook, Google and Twitter designed and launched technology platforms that now allow millions of users to share data simultaneously and in near-real time. Their success proved that “web-scale” data engineering was an achievable goal, and has encouraged enterprises in more traditional sectors to explore the potential of high-speed big data processing technologies.
Of course, the social media giants were not the first companies to experiment with what we now refer to as stream processing technologies: IBM has been working with government research organizations, universities, and defense and intelligence agencies since at least the early 2000s to build advanced stream processing engines—an effort that resulted in the first release of IBM Streams in 2009.
Nevertheless, today’s increasing mainstream adoption of stream processing owes much to the social media sector demonstrating the art of the possible. There is now a feeling that stream processing is a technology whose time has come, particularly as companies begin to look into the explosion of potential use-cases presented by cognitive computing and the Internet of Things (IoT).
As a result, for many companies today, the question is no longer whether to build applications that can process large streams of data, but how to architect those applications. Many enterprise technology vendors have followed IBM’s lead in developing their own proprietary stream processing engines; and more recently, the open source community has also begun focusing on the problem. In particular, the extensive open source ecosystem around Apache Hadoop has seen a proliferation of projects that purport to solve the problems of streaming data—including Apache Storm, Apache Apex, Apache Samza and Apache Flink, as well as Apache Spark Streaming.
Over the past few years, analysts such as Forrester and Bloor Research have produced extensive comparative studies of the major vendors’ stream processing engines. In Q1 2016, Forrester placed IBM Streams as an overall leader in the category, awarding IBM the top position for both the strength of its current offering, and its strategy and roadmap for the future. Meanwhile, Bloor used a different methodology to draw similar conclusions, naming IBM as a champion for both enterprise stream processing and cloud streaming analytics.
However, what happens when we compare IBM Streams to the newer wave of open source alternatives, which were beyond the scope of the Bloor and Forrester studies? In some cases, research undertaken by IBM clients can shed a light on the performance characteristics and likely total cost of ownership.
For example, in 2014, Walmart decided to compare IBM Streams with Apache Apex and Apache Storm in a comprehensive performance testing exercise using an updated version of the Linear Road Benchmark. Running on identical hardware in a Microsoft Azure cloud environment, IBM Streams achieved an “L-rating” of 200—nearly double the performance of Apache Apex (102), and 20 times the performance of Apache Storm (10). The results can be interpreted not only in terms of raw throughput, but also in terms of cost-efficiency: Walmart found that it would need 100 Microsoft Azure servers running Apache Storm to achieve the same results as just five servers running IBM Streams, resulting in a much lower total cost of ownership overall.
It would be interesting to run a similar comparison of IBM Streams against the more recent challengers in the open-source stream processing field: Apache Flink, Samza, and Spark Streaming. However, to date, it is difficult to find hard evidence of robust head-to-head benchmark results. For example, in a 2016 Master’s thesis, Yangjun Wang of Aalto University reviews the existing literature on streaming benchmarks. However, the only mention of IBM Streams is a 2014 study that once again pitted the IBM solution against Apache Storm.
Wang’s own benchmark compares Storm with Spark Streaming and Flink, concluding that Flink is nearly 30 times faster than Storm in terms of maximum throughput, while Spark Streaming is around 370 times faster. However, both solutions show significant trade-offs in terms of latency compared to Storm: median latency is 5.8 times higher for Flink, and 46.9 times higher for Spark Streaming. These results are interesting, but the lack of a direct comparison with IBM Streams limits our ability to draw any firm conclusions about relative performance.
Nevertheless, if we put direct performance comparisons to one side, and look instead at some of the qualitative differences between IBM Streams and the open-source alternatives, we can start to get a more rounded picture of how the solutions might stack up in a real-world deployment.
From its initial launch in 2009, IBM Streams has been designed for enterprise deployment, and includes a comprehensive set of accelerators and templates to help businesses get up and running quickly. For example, IBM can provide an accelerator to help financial markets recognize patterns in high-volume trading, and another to help telecom companies process and analyze millions of call data records per second. Streams is also designed to integrate easily with an enterprise’s existing data and analytics environments, such as data warehousing and big data platforms.
By contrast, the open-source solutions were mostly not originally intended for general-purpose enterprise use. Some were initially developed to solve a specific business problem for an individual company—for example, Apache Storm was originally championed by Twitter to solve the problem of ingesting millions of new Tweets per second, and was only subsequently open-sourced. Others arose from academic research—such as the Stratosphere project, which ultimately led to the creation of Apache Flink.
In consequence of this specialist lineage, these solutions can require somewhat higher skill-levels from their users—they are “written by developers, for developers”, and often expect users to be able to hand-code a solution that pieces together many different open-source components, rather than offering a single, coherent, well-tested platform that works out of the box.
IBM Streams takes the opposite approach, providing a “batteries-included” toolkit that increases development efficiency and keeps the barrier to entry as low as possible. For example, instead of requiring users to write low-level code to compose stream processing pipelines, IBM Streams offers Streams Studio, an integrated development environment (IDE) for the high-level Streams Processing Language (SPL) that minimizes boilerplate code and even allows drag-and-drop composition of applications, operators and functions.
Moreover, SPL is not the only option for developers: IBM Streams also supports enterprise languages like C/C++ and Java, as well as Scala and Python, which are today’s languages of choice for the data science community. In particular, Python support allows data scientists to leverage a huge range of popular scientific computing and machine learning libraries, build sophisticated applications, and quickly deploy them into production on IBM Streams.
In an increasingly fast-moving business environment, where speed-to-market is becoming more important than application performance, developer productivity is critical—so the availability of good tooling and language support really matters. The faster your developers can get started, and the lower the learning-curve, the better the chances of shipping an application ahead of your competitors. IBM Streams aligns completely with this philosophy, making it quick and easy for development teams to build applications, even if they are not stream processing experts.
As an example, the development and deployment of IBM Streams for the Walmart Linear Road Benchmark only required one developer, and was completed in just 14.5 days—including three days of unit testing and tuning. The high-quality development tools, combined with robust support for parallelized deployment and automatic scaling, made it possible to achieve this rapid time-to-market.
In conclusion, stream processing is becoming a more and more important topic for enterprise IT teams, especially with the rise of event-driven architectures and IoT applications. In response to this growing demand, a large number of alternative solutions have been developed, both by traditional enterprise technology vendors and by the open source community.
Evaluating all the options can be a daunting prospect, and no truly comprehensive set of benchmarks currently exists to assess all the available solutions on performance alone. That said, the opinion of several leading industry analysts seems to suggest that IBM Streams is a front-runner among the proprietary solutions, and existing head-to-head comparisons such as the Walmart Linear Road Benchmark show positive results versus both Apache Storm and Apache Apex.
Taking a broader view, the all-round package provided by IBM Streams is designed to help enterprises get the most out of the solution quickly, and achieve a robust, enterprise-ready deployment with minimal time and effort.
To learn more about IBM Streams or to take a deeper dive into tutorials and documentation, visit the IBM Streams webpage.