Leverage the benefits of enterprise Hadoop

Why commercial Hadoop implementations are optimal for enterprise deployments

MapReduce implementations are the technology of choice for enterprises that want to analyze big data at rest. Businesses have a choice between purely open source MapReduce implementations — most notably, Apache Hadoop — and commercial implementations. Here, the authors argue the case that enterprise requirements are better served by Hadoop-based products, such as InfoSphere® BigInsights™ than they are by "vanilla" Hadoop.

Share:

Areeb Kamran (areeb.cs@gmail.com), ERP consultant, Consultant

Areeb Kamran photoAreeb Kamran holds a graduate degree in computer systems. He has been working for a Fortune 500 multinational for the past three years as an ERP consultant with a primary focus on materials management and supply chain. He is also actively involved in academic research in machine learning and its application in business reporting, forecasting, and analytics.



Salman Ul Haq (salman@tunacode.com), CEO, TunaCode

Photo of Salman Ul HaqSalman Ul Haq is a technology entrepreneur from Pakistan. He is the CEO and co-founder of TunaCode, a technology startup focused on delivering GPU computing solutions to the scientific community as well as commercial companies that require massive compute power. He is also the co-founder and chief editor of ProgrammerFish, which is a leading technology blog read by thousands daily. His main role includes writing and managing the blog and interaction with clients, readers, and subscribers.



16 July 2013

Also available in Chinese Russian

Analytics are at the core of any enterprise big data deployment. Relational databases are still the best technology for running transactional applications — certainly crucial for most enterprises — but when it comes to data analysis, relational databases are showing signs of stress. Enterprise adoption of Apache Hadoop (or Hadoop-like big data systems) reflects a focus on performing analytics rather than on merely storing transactions.

To implement Hadoop or Hadoop-like systems with analytics successfully, an enterprise must address a set of readiness issues that fit into four categories:

  • Security— Preventing data theft and controlling access
  • Support— Documentation and consulting
  • Analytics— The minimum analytics feature set that enterprises require
  • Integration— Integration with legacy or third-party products for the purpose of data migration or data exchange

Using these four categories as bases for comparison, this article makes a case for enterprise adoption of commercial Hadoop products, such as InfoSphere BigInsights for big data analytics, rather than of open source "vanilla" Hadoop installations.

InfoSphere BigInsights

InfoSphere BigInsights is the IBM distribution of Hadoop. It includes core Hadoop (Hadoop Distributed File System, MapReduce) and some of the other services in the Hadoop ecosystem, such as Apache Pig, Hive, and ZooKeeper; and it adds operational-excellence features such as big data optimized compression, workload management, scheduling capabilities, and an application development and deployment ecosystem. Learn more and try it at no charge.

Preventing data theft and controlling access

Security issues are a common concern in Hadoop deployments. By design, Hadoop stores and processes unstructured data that originates from multiple sources. Access control, data entitlement, and ownership problems can result. IT managers need to control access to the data entering the system and the data going out. The fact that Hadoop (or Hadoop-like environments) include data with various classifications and sensitivity levels can exacerbate access-control problems. The ultimate risks are data theft and inappropriate data access or disclosure.

Data theft in an endemic problem at the enterprise level. Attacks on corporate IT systems are common. These issues have been addressed in traditional relational systems. But implementing solutions for big data systems is a different matter, given the new set of technologies in play. By default, most big data systems do not encrypt data at rest, an issue that must be addressed as a first step. And cluster administration is required for the clusters of available data. Again, relational systems have overcome similar problems. But given that cluster-administration tools aren't yet available for Hadoop-like systems, unwanted direct access to data files or data-node processes is possible.

Furthermore, the merging of multiple datasets for analysis creates a new dataset that can require separate access controls. Roles that were applied to the data sources must now be defined for the combination of data sources. Clear boundaries must be defined for roles on either a technical or a functional basis. Neither option is perfect. Establishing roles on a functional basis can enable snooping into data here and there, but it's easier for administrators to implement when the datasets are merged. Using a technical basis can secure the original data nodes, but create access problems when the nodes are merged. The built-in access-control and security features in Hadoop Distributed File System (HDFS) cannot address this dilemma. Some companies that use Hadoop are building new environments that store merged datasets, or they are protecting access to the merged data through customized firewalls.

Products such as InfoSphere Guardium® Data Security (see Resources) can come to the rescue to ensure data security in Hadoop-based systems. InfoSphere Guardium Data Security automates the entire compliance-auditing process in heterogeneous environments through features such as auto-discovery of sensitive data, automated compliance reporting, and data-level access control.


Documentation and consulting

Lack of documentation is another common enterprise concern. Roles and specifications change, and consultants and employees leave. Unless roles and specifications are well-documented, many efforts must start from square one when a change occurs. This is a major issue with open source Apache Hadoop. In contrast, structured Hadoop-based products designed for enterprises, such as IBM InfoSphere BigInsights, can resolve this by providing structured documentation and enterprise-level support. BigInsights adds these advantages to the fact that every development designed for the open source version of Hadoop works with BigInsights, too — because BigInsights is built on Apache Hadoop.

By deploying a product such as InfoSphere BigInsights, an enterprise gains the benefit of the external support provided. For business reasons, large enterprises usually keep a support team only for core IT functions. Complex deployments are almost impossible for such teams to carry out, given their level of technical expertise. Some small companies specialize in and succeed with helping larger companies carry out complex Hadoop deployments. But small ones can't be relied upon for long-term support because they might not exist for the long term.

The structured consulting and support that a major vendor provides address these issues. A standard Hadoop version can be deployed, tracked, and supported to meet enterprise needs and expectations. External consultants can also assume the roles of full-time employees — but with the right skill set. And they can apply experience and best practices acquired from a range of industries. This is an especially important benefit, given that big data is still a new field with a dearth of expertise. Consulting for big data can also serve the training needs of in-house teams and be used to augment employees' skill sets. Consultant support can be used for extension projects and for regular maintenance.


Creating business value through analytics

Big data deployments are all about maximizing information gain. Apache Hadoop provides the technical prowess and infrastructure for handling the three V's of data: volume, variety, and velocity. But accumulating and handling all of that data has no point unless the data can be analyzed. Data can come in from multiple data sources: flat files, databases, packaged applications, enterprise resource planning (ERP) or customer relationship management (CRM) systems, or as streams. Managing the data and storing it, which Hadoop is adept at, come first. But data management and storage do not in themselves provide any business value. Business value comes from analyzing the data. (This is where relational databases are failing. They can store large data volumes but can't process it efficiently in real time.)

To analyze data stored in Hadoop, applications designed for that purpose must be built on top of Hadoop. They can be statistical data visualization tools or analytics tools. If they are not built from scratch, software such as IBM SPSS, SAS, or R must be linked to Hadoop through APIs. Even Google, which invented MapReduce, now uses it only to collect and organize data. For analysis Google uses Dremel, a scalable query system for analysis of read-only nested data.

Enterprises — even those that aren't large-scale Internet companies dealing with petabytes of data — still have ample use cases for analytics, including:

  • Risk analysis in financial services
  • Fraud detection
  • Programmatic split-second trading
  • Understanding customer behavior for insurance purposes
  • Understanding customer behavior to improve credit risk management
  • Analyzing vendor performance in high-speed services businesses or for optimizing related services
  • Healthcare analytics
  • Manufacturing and monitoring of smart products, such as those embedded with radio-frequency ID (RFID) tags (such as courier services or inventory systems)
  • Cost management
  • Sensor data analysis
  • Customer transaction analysis for marketing purposes (for example, in the telecom industry, which frequently offers call and data packages based upon prevailing customer trends)
  • Marketing campaigns conducted through social media

Traditional data-analysis or business intelligence tools can't analyze the volumes of data used for these purposes. The software you use must not only be able to perform large-scale analysis but it also must be capable of drilling down to details to work out the action required for whatever business purposes the analysis serves. This capability — getting the actionable information nuggets — is the Holy Grail of analytics. It's also where most big data analytics fails. You can do one or the other: The more large-scale analysis you do, the less you can drill down into details, and vice-versa.

InfoSphere BigInsights enables large-scale analytics and deep insight. Using its included Hadoop implementation, it keeps exploratory analysis of large volumes in mind and enables multi-structured data insight not previously possible. It supports built-in data compression and features such as the JSON Query Language (JAQL) to support easy manipulation and analysis of semi-structured JSON data. On top of it all, it features MapReduce-based text and machine-learning analytics. This is of critical importance because when trying to get insight from large-scale data, it is often impossible to know what exactly one is looking for. Machine learning is useful for discovering and forecasting patterns and trends and for extracting a statistical model, if any, from unstructured data.


Integration with legacy and third-party systems

PureData System for Hadoop

PureData System for Hadoop is a purpose-built, standards-based, expert integrated system that architecturally integrates IBM InfoSphere BigInsights. It optimizes Hadoop data services for big data analytics and online archive with appliance simplicity. You gain enterprise Hadoop capabilities with easy-to-use analytic tools and visualization for business analysts and data scientists. It comes with rich developer tools, powerful analytic functions, and exceptional administration and management capabilities, as well as the latest versions of Hadoop and associated projects. It also provides extensive capabilities with enhanced big data tools for monitoring, development, and integration with many more enterprise systems.

For practical reasons, advanced applications such as ERP software can't currently be built on top of Hadoop. Instead, data from third-party systems must be integrated seamlessly with Hadoop-like systems. The most common way of bringing in web-based data is through SOAP. For other applications, specialized connectors are required that are mostly built with Java™, .NET, or C++. You can either develop these custom integration programs or use a product such as IBM Netezza. In addition to providing a large library of parallelized advanced and predictive algorithms, Netezza enables you to create custom analytics in a number of programming languages (including C, C++, Java, Perl, Python, and R). It enables the integration of SPSS® or of analytic software from companies such as SAS, Revolution Analytics (for Enterprise R), Fuzzy Logix, and Zementis. Its programmatic interface also enables integration with virtually any ERP system that has connectors for C and Java (such as SAP's Jco Java connector).

InfoSphere BigInsights goes a step further in the category of third-party integration by supporting Cloudera's distribution of Hadoop in addition to IBM's. Cloudera support is important because Cloudera has a large customer base. Now those customers can easily use BigInsights tools.

For data streams from multiple sources, BigInsights can connect directly to DB2®, Netezza, and PureData™. It also comes with BigIndex, a MapReduce facility that builds indices for search-based analytic applications.


Conclusion

Hadoop that leverages integrated analytics capabilities is the ideal stack for enterprise use. Vanilla Hadoop, which can't easily take advantage of analytics applications, offers no business value in and of itself. And developing analytic and cross-application features and support from scratch to support vanilla Hadoop is a mammoth, time-consuming, and probably prohibitively expensive task. Enterprise Hadoop products such as InfoSphere BigInsights solve the technical issues associated with deployment, make consulting easily available and sustainable, and feature seamless integration with a large number of legacy and contemporary systems. Enterprise Hadoop includes leading-edge analytics tools for gaining insights from data itself, and for merging it with Internet-based and sensor data to glean hidden nuggets of actionable information.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=937142
ArticleTitle=Leverage the benefits of enterprise Hadoop
publish-date=07162013