Analytics are at the core of any enterprise big data deployment. Relational databases are still the best technology for running transactional applications — certainly crucial for most enterprises — but when it comes to data analysis, relational databases are showing signs of stress. Enterprise adoption of Apache Hadoop (or Hadoop-like big data systems) reflects a focus on performing analytics rather than on merely storing transactions.
To implement Hadoop or Hadoop-like systems with analytics successfully, an enterprise must address a set of readiness issues that fit into four categories:
- Security— Preventing data theft and controlling access
- Support— Documentation and consulting
- Analytics— The minimum analytics feature set that enterprises require
- Integration— Integration with legacy or third-party products for the purpose of data migration or data exchange
Using these four categories as bases for comparison, this article makes a case for enterprise adoption of commercial Hadoop products, such as InfoSphere BigInsights for big data analytics, rather than of open source "vanilla" Hadoop installations.
Preventing data theft and controlling access
Security issues are a common concern in Hadoop deployments. By design, Hadoop stores and processes unstructured data that originates from multiple sources. Access control, data entitlement, and ownership problems can result. IT managers need to control access to the data entering the system and the data going out. The fact that Hadoop (or Hadoop-like environments) include data with various classifications and sensitivity levels can exacerbate access-control problems. The ultimate risks are data theft and inappropriate data access or disclosure.
Data theft in an endemic problem at the enterprise level. Attacks on corporate IT systems are common. These issues have been addressed in traditional relational systems. But implementing solutions for big data systems is a different matter, given the new set of technologies in play. By default, most big data systems do not encrypt data at rest, an issue that must be addressed as a first step. And cluster administration is required for the clusters of available data. Again, relational systems have overcome similar problems. But given that cluster-administration tools aren't yet available for Hadoop-like systems, unwanted direct access to data files or data-node processes is possible.
Furthermore, the merging of multiple datasets for analysis creates a new dataset that can require separate access controls. Roles that were applied to the data sources must now be defined for the combination of data sources. Clear boundaries must be defined for roles on either a technical or a functional basis. Neither option is perfect. Establishing roles on a functional basis can enable snooping into data here and there, but it's easier for administrators to implement when the datasets are merged. Using a technical basis can secure the original data nodes, but create access problems when the nodes are merged. The built-in access-control and security features in Hadoop Distributed File System (HDFS) cannot address this dilemma. Some companies that use Hadoop are building new environments that store merged datasets, or they are protecting access to the merged data through customized firewalls.
Products such as InfoSphere Guardium® Data Security (see Resources) can come to the rescue to ensure data security in Hadoop-based systems. InfoSphere Guardium Data Security automates the entire compliance-auditing process in heterogeneous environments through features such as auto-discovery of sensitive data, automated compliance reporting, and data-level access control.
Documentation and consulting
Lack of documentation is another common enterprise concern. Roles and specifications change, and consultants and employees leave. Unless roles and specifications are well-documented, many efforts must start from square one when a change occurs. This is a major issue with open source Apache Hadoop. In contrast, structured Hadoop-based products designed for enterprises, such as IBM InfoSphere BigInsights, can resolve this by providing structured documentation and enterprise-level support. BigInsights adds these advantages to the fact that every development designed for the open source version of Hadoop works with BigInsights, too — because BigInsights is built on Apache Hadoop.
By deploying a product such as InfoSphere BigInsights, an enterprise gains the benefit of the external support provided. For business reasons, large enterprises usually keep a support team only for core IT functions. Complex deployments are almost impossible for such teams to carry out, given their level of technical expertise. Some small companies specialize in and succeed with helping larger companies carry out complex Hadoop deployments. But small ones can't be relied upon for long-term support because they might not exist for the long term.
The structured consulting and support that a major vendor provides address these issues. A standard Hadoop version can be deployed, tracked, and supported to meet enterprise needs and expectations. External consultants can also assume the roles of full-time employees — but with the right skill set. And they can apply experience and best practices acquired from a range of industries. This is an especially important benefit, given that big data is still a new field with a dearth of expertise. Consulting for big data can also serve the training needs of in-house teams and be used to augment employees' skill sets. Consultant support can be used for extension projects and for regular maintenance.
Creating business value through analytics
Big data deployments are all about maximizing information gain. Apache Hadoop provides the technical prowess and infrastructure for handling the three V's of data: volume, variety, and velocity. But accumulating and handling all of that data has no point unless the data can be analyzed. Data can come in from multiple data sources: flat files, databases, packaged applications, enterprise resource planning (ERP) or customer relationship management (CRM) systems, or as streams. Managing the data and storing it, which Hadoop is adept at, come first. But data management and storage do not in themselves provide any business value. Business value comes from analyzing the data. (This is where relational databases are failing. They can store large data volumes but can't process it efficiently in real time.)
To analyze data stored in Hadoop, applications designed for that purpose must be built on top of Hadoop. They can be statistical data visualization tools or analytics tools. If they are not built from scratch, software such as IBM SPSS, SAS, or R must be linked to Hadoop through APIs. Even Google, which invented MapReduce, now uses it only to collect and organize data. For analysis Google uses Dremel, a scalable query system for analysis of read-only nested data.
Enterprises — even those that aren't large-scale Internet companies dealing with petabytes of data — still have ample use cases for analytics, including:
- Risk analysis in financial services
- Fraud detection
- Programmatic split-second trading
- Understanding customer behavior for insurance purposes
- Understanding customer behavior to improve credit risk management
- Analyzing vendor performance in high-speed services businesses or for optimizing related services
- Healthcare analytics
- Manufacturing and monitoring of smart products, such as those embedded with radio-frequency ID (RFID) tags (such as courier services or inventory systems)
- Cost management
- Sensor data analysis
- Customer transaction analysis for marketing purposes (for example, in the telecom industry, which frequently offers call and data packages based upon prevailing customer trends)
- Marketing campaigns conducted through social media
Traditional data-analysis or business intelligence tools can't analyze the volumes of data used for these purposes. The software you use must not only be able to perform large-scale analysis but it also must be capable of drilling down to details to work out the action required for whatever business purposes the analysis serves. This capability — getting the actionable information nuggets — is the Holy Grail of analytics. It's also where most big data analytics fails. You can do one or the other: The more large-scale analysis you do, the less you can drill down into details, and vice-versa.
InfoSphere BigInsights enables large-scale analytics and deep insight. Using its included Hadoop implementation, it keeps exploratory analysis of large volumes in mind and enables multi-structured data insight not previously possible. It supports built-in data compression and features such as the JSON Query Language (JAQL) to support easy manipulation and analysis of semi-structured JSON data. On top of it all, it features MapReduce-based text and machine-learning analytics. This is of critical importance because when trying to get insight from large-scale data, it is often impossible to know what exactly one is looking for. Machine learning is useful for discovering and forecasting patterns and trends and for extracting a statistical model, if any, from unstructured data.
Integration with legacy and third-party systems
For practical reasons, advanced applications such as ERP software can't currently be built on top of Hadoop. Instead, data from third-party systems must be integrated seamlessly with Hadoop-like systems. The most common way of bringing in web-based data is through SOAP. For other applications, specialized connectors are required that are mostly built with Java™, .NET, or C++. You can either develop these custom integration programs or use a product such as IBM Netezza. In addition to providing a large library of parallelized advanced and predictive algorithms, Netezza enables you to create custom analytics in a number of programming languages (including C, C++, Java, Perl, Python, and R). It enables the integration of SPSS® or of analytic software from companies such as SAS, Revolution Analytics (for Enterprise R), Fuzzy Logix, and Zementis. Its programmatic interface also enables integration with virtually any ERP system that has connectors for C and Java (such as SAP's Jco Java connector).
InfoSphere BigInsights goes a step further in the category of third-party integration by supporting Cloudera's distribution of Hadoop in addition to IBM's. Cloudera support is important because Cloudera has a large customer base. Now those customers can easily use BigInsights tools.
For data streams from multiple sources, BigInsights can connect directly to DB2®, Netezza, and PureData™. It also comes with BigIndex, a MapReduce facility that builds indices for search-based analytic applications.
Hadoop that leverages integrated analytics capabilities is the ideal stack for enterprise use. Vanilla Hadoop, which can't easily take advantage of analytics applications, offers no business value in and of itself. And developing analytic and cross-application features and support from scratch to support vanilla Hadoop is a mammoth, time-consuming, and probably prohibitively expensive task. Enterprise Hadoop products such as InfoSphere BigInsights solve the technical issues associated with deployment, make consulting easily available and sustainable, and feature seamless integration with a large number of legacy and contemporary systems. Enterprise Hadoop includes leading-edge analytics tools for gaining insights from data itself, and for merging it with Internet-based and sensor data to glean hidden nuggets of actionable information.
- Visit the Apache Hadoop project website.
- Explore Hadoop on developerWorks and discover a wealth of articles and other resources on Apache Hadoop and related technologies.
- Check out the Big Data Glossary to read about 60 recent innovations in big data technology.
- Hadoop: The Definitive Guide explains how to build and maintain distributed systems with the Hadoop framework.
- "MapReduce and Parallel DBMSs: Friends or Foes?" describes how MapReduce systems complement relational databases.
- Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets.
- Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
- Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- InfoSphere Guardium ensures the privacy and integrity of trusted information in your data center, reducing costs by automating the entire compliance auditing process in heterogeneous environments.
- PureData System is optimized for delivering data services to today's demanding applications. PureData System for Hadoop is the newest member of the IBM PureSystem family.
- Netezza Analytics is an embedded, purpose-built, advanced analytics platform.
- Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.
- Download InfoSphere Streams, available as a native software installation or as a VMware image.
- Use InfoSphere Streams on IBM SmartCloud Enterprise.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Ask questions and get answers in the InfoSphere BigInsights forum.
- Ask questions and get answers in the InfoSphere Streams forum.
- Check out the developerWorks blogs and get involved in the developerWorks community.