Advanced analytics was promoted to the boardroom. According to research by Andrew McAfee and Erik Brynjolfsson, of MIT, companies that inject big data and analytics into their operations show productivity rates and profitability that are 5% to 6% higher than their peers (see Resources for a link).
The implementation of a successful advanced analytics solution remains an art with many products and competing requirements.
The purpose of this series is to enable the user to understand the need for an advanced analytics platform within the enterprise and how to design such a platform. A number of these use cases and requirements introduce the Advanced Analytics Platform (AAP), architecture for performing advanced Analytics on big data. The series also introduces a number of design patterns. Each pattern provides a complete solution for a business problem, and yet these components can be placed together to form the end-to-end solution for an entire business enterprise.
AAP gradually evolved through a series of steps, each seeking solutions for a tangible mission critical business use case. Each of these experiments fire-tested the architecture on non-functional requirements and required a unique combination of products. The first two articles provide the motivation behind AAP and establish the overall definitions, including drivers, use cases, and architecture components. The next four articles provide details of architecture patterns, with each article covering a pattern and providing sample use cases, key architecture components, architectural and technical considerations, and sample implementation details. The last two articles cover common issues around governance and infrastructure design.
Big data feeds advanced analytics
The primary driver for advanced analytics platform is the wide spread availability of big data. Due to a combination of automation, consumer involvement, and market-based exchanges, big data is becoming readily available and is getting deployed in a series of vital use cases that are radically disrupting the markets (see Arvind Sathi's comments on big data analytics and disruptive technologies in Resources).
Examples of big data include:
- Social media text: A large variety of data that includes structure data, text, and media content is found at various social media sites. This data contains information that any enterprise can mine for valuable insights.
- Cell phone details: Currently, over 5 billion cell phones provide useful information such as user location, how the device is being used, and issues that the device might have.
- Channel click information from set-top boxes: The interaction of users with set-top boxes provides valuable information on their interests and can improve the user experience with media content.
- Transactions: A number of devices — such as credit cards, mobile wallets, and others — enable recorded transactions that facilitate cashless buying and selling of goods and services and also record these transactions for historical analysis.
- Web browsing and search: Most websites store user browsing and search habits in web logs but few of them analyze the data to better understand the site users and how to improve the site content.
- Documentation: Documentation such as statements, insurance forms, medical records, and customer correspondence can be parsed to extract valuable information for further analysis.
- Internet of things: The internet of things is generating large volumes and variety of data from various sources that include ebooks, vehicles, video games, cable boxes, and household appliances. Companies can gain valuable insights when they capture, correlate, and analyze this data.
- Communications network events: The network is greatly affected as systems become interconnected. Interconnection results in the need to monitor large volumes of data and respond quickly to changes.
- Call detail records (CDRs): Analysis of CDRs helps you understand various habits of your customer and their social network so you can serve the customer better.
- Radio Frequency Identification (RFID) tags: RFIDs are ubiquitous and valuable data is often ignored and not analyzed.
- Traffic patterns: On-road sensors, video cameras, and floating-car data are recent techniques to study traffic patterns. To prevent traffic congestions, this data requires rapid analyzing.
- Weather data: Weather data is now correlated to various other large sources of data such as sales, marketing, and product data so companies can market their products more effectively and cut costs.
Why is big data different from any other data that you dealt with in the past? There are "four Vs" that characterize this data: Volume, Velocity, Variety, and Veracity.
Volume: Most organizations were already struggling with the increasing size of their databases as the big data tsunami reached the data stores. According to Fortune magazine, humans created 5 exabytes of digital data in recorded time until 2003. In 2011, the same amount of data was created in two days. By 2013, that time period is expected to shrink to just 10 minutes. (See the Fortune article about data in Resources.)
A decade ago, organizations typically counted their data storage for analytics infrastructure in terabytes. They now use applications that require storage in petabytes. This data is straining the analytics infrastructure in a number of industries. For a communications service provider (CSP) with 100 million customers, the daily location data can amount to about 50 terabytes, which, if stored for 100 days, occupies about 5 petabytes. For a 100-million-subscriber CSP, the CDRs might easily exceed 5 billion records a day. As of 2010, AT&T had 193 trillion CDRs in its database. (Find more about the world's 10 largest databases in Resources).
Most organizations are discarding big data because they lack the capacity to store and analyze large volumes of data.
Velocity: The two aspects of velocity represent data throughput and latency. Throughput represents the data that moves in the pipes. The amount of global mobile data is growing at a 78 percent compounded growth rate and is expected to reach 10.8 exabytes per month in 2016 as consumers share more pictures and videos (see the Statshot item in Resources).
To analyze this data, the corporate analytics infrastructure is seeking bigger pipes and massively parallel processing.
Latency is the other measure of velocity. Analytics used to be a "store and report" environment where reporting typically contained data as of yesterday and is popularly represented as "D-1." Now, the analytics is increasingly embedded in business processes using data-in-motion with reduced latency. For example, Turn (www.turn.com) is conducting its analytics in 10 milliseconds to place advertisements in online advertising platforms. (See the article by Kate Maddox in Resources.)
Variety: In the 1990s, as data warehouse technology was rapidly introduced, the initial push was to create meta-models to represent all the data in one standard format. The data was compiled from various sources and transformed using ETL (Extract, Transform, Load) or ELT (Extract the data and Load it in the warehouse, then Transform it inside the warehouse). The basic premise was narrow variety and structured content. Big data expanded horizons, enabled by new data integration and analytics technologies. A number of call center analytics solutions seek analysis of call center conversations and their correlation with emails, trouble tickets, and social media blogs. The source data includes unstructured text, sound, and video in addition to structured data. A number of other related applications are gathering data from emails, documents, or blogs. An example of enabling technology is the IBM® InfoSphere® Streams platform. InfoSphere Streams deals with various sources for real-time analytics and decision making, including medical instruments for neonatal analysis, seismic data, CDRs, network events, RFID tags, traffic patterns, weather data, mainframe logs, voice in many languages, and video.
Veracity: Unlike carefully governed internal data, most big data comes from sources outside your control and therefore suffers from significant correctness or accuracy problems. Veracity represents both the credibility of the data source and the suitability of the data for the target audience. Start with source credibility. Suppose that an organization collects product information from third parties and offers it to their contact center employees to support customer queries. The collected data must be screened for source accuracy and credibility. Otherwise, the contact centers might recommend competitive offers that marginalize offerings and reduce revenue opportunities. Some social media responses to campaigns might come from a few disgruntled past employees or persons that are employed by the competition to post negative comments. For example, internet users assume that "like" on a product signifies satisfied customers. What if a third party placed the "like" for a fee? (See Ben Grubb's article about purchased "likes" in Resources.)
You must also think about audience suitability and how much truth can be shared with a specific audience. The veracity of data that is created within an organization can be assumed to be at least well intentioned. However, some of the internal data might not be available for wider communication. For example, if customer service provided inputs to engineering on product shortcomings as seen at the customer touch points, this data should be shared selectively, on a need-to-know basis. Other data might be shared only with customers who have valid contracts or other prerequisites.
These Vs place significant strains on the current analytics solutions as those solutions were designed for carefully orchestrated, structured, low volume data at "D-1" latency.
The need for an advanced analytics platform
IT organizations in all major corporations face several important architecture decisions.
- First, an existing infrastructure, with a large body of professionals who care for and feed the current analytics platform, is severely constrained by the growing demand for the four Vs and faces the big data tsunami. Continued investment in the current infrastructure to meet future demand is next to impossible.
- Second, as market forces seek new ways to create analytics-driven organizations, they are forcing massive changes in how you deal with marketing, sales, operations, and revenue management. Intelligent consumers, green field competitors, and smart suppliers are forcing the organizations to rapidly bring advanced analytics to all major business processes.
- Third, the new Massively Parallel Processing (MPP) platforms, the open source technologies, and cloud-based deployment are rapidly changing how new architecture components are developed, integrated, and deployed.
AAP grew under these architecture demands with the following propositions:
- It integrates with and caps the current analytics architecture to the mature functions, which continue to require the current warehouses and structured reporting environments. This integration includes important functions, such as financial reporting, operational management, human resources, and compliance reporting. Most organizations have mature data flows, analytics solutions, and support environments. These environments will gradually change, but a radical change takes time and investment, and might not result in the biggest pay back.
- It overlays a big data architecture that shares critical reference data with the current environment, and provides the necessary extensions to deal with semi-structured and unstructured data. It also facilitates complex discoveries, predictive modeling, and engines to carry the decisions driven by the insight created through advanced analytics.
- It adds a necessary real-time streaming layer, which is adaptive using discovery and predictive modeling components, and offers decision-making in seconds or milliseconds as needed for business execution.
- It uses a series of APIs to open up the data and analytics to external parties — business partners, customers, and suppliers.
AAP Architecture and Components
AAP includes the following architecture components:
Stream processing: This MPP real-time analytics component processes streaming data. Take the example of intelligent campaigns using location data. Mobile devices generate billions of transactions a day and require campaign execution as the consumer hangs around a store. The stream processing is conducted with three subcomponents:
- Capabilities to sense, identify, and align incoming streaming data. These capabilities align incoming transaction data to known customers or events and joins transactions to identify the context and user actions.
- Capabilities to categorize, count, and focus. These capabilities provide important real-time attributions to the source data, for example, "more than two dropped calls in an hour" or "a user who clicks an advertisement and goes to advertiser's site to shop". These functions use a set of dynamic parameters, which are constantly updated based on deep analytics on historical data. For example, the historical data might find that the probability of customer churn increases after two dropped calls.
- Capabilities to score and decide. A set of scoring models might be promoted through predictive modeling that uses historical data. These models can be scored in real-time using streaming data and used for decision-making. In addition, complex decision-trees or other rule-based strategies can also be executed with run time engines that take their rules from a Business Rule Management System (BRMS).
Predictive decision modeling: A statistical modeling engine creates a series of models by analyzing historical data. It then deploys these models, tracks model success and failure in predictive outcomes, and replaces them with better ones. A model creation component provides the capability of developing predictive models using historical data. These models can be executed in batch or through streaming data, in real time. The results of model execution are used by an outcome optimization component to compare predictive models and choose the most successful ones. At any time, the predictive modeler can create hundreds of predictive models, constantly testing them with real processes and optimizing across those models to provide the optimal results.
Analytics engine: An MPP data warehouse (in the analytics engine) can also run advanced queries so you can perform all the predictive modeling and visualization functions in the engine. The stored data is typically too large to ship to external tools for predictive modeling or visualization. The engine performs these functions based on commands that are given by predictive modeling and visualization tools. These commands are typically translated into native functions (for example, SQL commands), which are executed in a specialized MPP hardware environment to deal with high volume data. Analytics engines carry typical functions for ELT (organization of ingested data using transformations), execution of predictive models and reports, and for any other data crunching jobs (for example, geospatial analysis).
Discovery: These tools take data with high variety and look for qualitative or quantitative patterns. Discovery tools include general or specialized search across unstructured data. They carry specialized tools for machine learning for pattern recognition and can carry ontologies of different domains to make their task intelligent. With the explosion of big data, these tools advanced significantly in the qualitative analysis of unstructured data, such as social media blogs. The results of the analysis can include quantification of data, which is transferred to the analytics engine for further analysis, processing, or use. It might also result in qualitative reporting, such as semantic mapping and geospatial visualization.
Visualization: The analyst community uses the visualization component to render the results of analytics in various forms. The component includes structured reports, dash-boards, geospatial or semantic display of information, or simulation. Visualization techniques offer significant interactions so the analysts can break the data into smaller pieces, which are based on pre-determined or ad hoc parameters.
Hadoop Distributed File System/General Parallel File System: Big data has seen the evolution of Hadoop Distributed File System (HDFS), a pointerless data storage mechanism open-sourced and supported by several information technology providers, including IBM. The value proposition of HDFS is its ability to use MPP architecture to store the data redundantly in a number of commodity processes and use the parallel framework to execute complex queries, including discovery on unstructured data in a highly efficient way. HDFS offers a serious alternative to the structured data warehouses for storage, retrieval, and analysis of big data. IBM is working on a high performance file management platform called GPFS which is an alternative to HDFS.
Data integration and governance: The architecture offers a number of specialized data stores for real-time, unstructured and structured data. An integrated set of tools is needed for data integration across this diverse architecture and to perform governance on key subject areas. The standard functions of master data management, information lifecycle management, data privacy management, and data quality must be augmented to deal with a hybrid architecture.
The diagram in Figure 1 provides a high-level architecture of how the components come together in an advanced analytics platform. Details on the components will be provided in future articles.
Figure 1. Advanced Analytics Platform
What make these components different?
When you deal with big data, certain terms are defined differently than you might expect. This section provides some examples and the rest of the articles in this series will use the terms as defined here.
Reporting versus insight: Many people believe that reports are the key mechanisms for gaining insight into the data. Reporting is typically the first task for an analytics system but it is definitely not the last. You build on reporting often by visualization of various forms that include overlaying of geospatial visualization and creation of new semantic models. Doing so helps you to gain insight leading to new abstracted data. These insights can be broad ranging from mobility patterns to micro segments. As you gain insight, you contribute previously unseen patterns through discovery. This pattern discovery that leads to deep insights is core to effectively using big data to transform the enterprise.
Sources of data and data integration: Merely having data does not mean you can start applying analytic tools on the data. You often must extract, cleanse, and transform (ETL) the data before you can effectively apply the analytic tools. Beyond ETL, it is important to integrate multiple data sources so that the analytics tools can identify key patterns. This integration is especially important given the wide variety of data sources available today. Departments create new intra-department data everyday including sensor, networking, and transaction data, which affect the department. The enterprise creates data such as billing, customer, and marketing data, which are essential for the enterprise to operate effectively. Third-party data also becomes critical often sourced from social media, or purchased from third-party sources. These various sources of data, which are often difficult to correlate, must be integrated to truly gain insights, which are currently not possible.
Latency and historical analytics tradeoffs: Latency that is associated with the data can often have a huge impact on how one analyzes the data and the response to the insights gained from the analytics. The perception often is that when you increase the data gathering speed or fine-tune the hardware and software, you can move from historical analytics to real time. Historical analytics cannot often be performed in real time for a variety of reasons including the lack of access to critical data in a synchronized manner at the right time, tools that cannot perform analytics in real time, and required dynamic model changes that are not part of existing tools for historical data. This is partly because real-time analytics introduces extra complications such as the need for logic and models to change dynamically as new insights are discovered. In addition, real-time analytics can be more expensive than historical analytics so you must consider return on investment to justify the additional expenses.
Veracity and data governance: As mentioned earlier, Veracity represents both the credibility of the data source and the suitability of the data for the target audience. Governance deals with issues such as how to cleanse and use the data, ensure data security but still enable users to gain valuable insight from the data, and identify the source of truth when you use multiple sources for a data source and determine which is the source of truth. In most environments, data is a mixture of clean trusted data and dirty untrustworthy data. One key challenge is how to apply governance in such an environment.
This article provided a set of market drivers for big data and the emergence of the Advanced Analytics Platform in response to the market drivers. Big data led to significant requirements in scaling the analytics architectures to cover high velocity, high volume, high variety as well as high veracity. No single tool can deal with these diverse requirements. As a result, a number of hybrid architectures are evolving. AAP grew through field experiments and provides the necessary components to deal with big data. A brief description of the platform and its components, including stream processing, predictive modeling, analytics engine, discovery, HDFS, visualization, data integration, and governance was provided.
In the next article, we will illustrate the use of the architecture through a number of use cases. These use cases are based on fielded systems. We will stress the nonfunctional requirements and how the architecture is able to meet these requirements.
- Big Data, The Management Revolution (Andrew McAfee and Erik Brynjolfsson, Harvard Business Review, October 2012): Read more about the importance of analyzing big data to business success.
- Big Data Analytics – Disruptive Technologies for Changing the Game (Arvind Sathi, MC Press, October 2012): In this practitioner's view of big data analytics, discover how big data changes analytics architecture.
- What Data Says About Us (Fortune, 24 September 2012, page 163): Read more about the world that can be measured.
- Top 10 Largest Databases in the World (Compare Business Products, 17 March 2010): Discover which organizations have the biggest collections of data.
- Statshot: How Mobile Data Traffic Will Grow by 2016 (Om Malik, gigaom, 23 August 2012): See how the growth of connected devices pushes the demand for mobile data.
- Turn Ad Inspired by 'Mad Men' (Kate Maddox, 16 July 2012): Read about a cloud-computing ad that mimics an award-winning advertising drama.
- Can't Buy Love Online? 'Likes' for Sale (Ben Grubb, stuff.co.nz, 24 August 2012): Find out which is better, a genuine 'Like' or a purchased 'Like' that is placed by a third party?
- The developerWorks Business analytics topic: Find how-to information, tools, and updates to help you improve outcome and control risk.
- The developerWorks Big data content area: Learn more about big data. Find technical documentation, how-to articles, education, downloads, product information, and more.
Get products and technologies
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment.
- Get involved in the developerWorks community. Connect with other developerWorks users while you explore the developer-driven blogs, forums, groups, and wikis.