Everyday, we create 2.5
quintillion bytes of data–so much that 90% of the data in the world
today has been created in the last two years alone. This data comes
from everywhere: from sensors used to gather climate information,
posts to social media sites, digital pictures and videos posted
online, transaction records of online purchases, and from cell phone
GPS signals to name a few. This data is big data.
Apache Hadoop is an open
source software project that enables the distributed processing of
large data sets across clusters of commodity servers. It is designed
to scale up from a single server to thousands of machines, with a
very high degree of fault tolerance. Rather than relying on high-end
hardware, the resiliency of these clusters comes from the software’s
ability to detect and handle failures at the application layer.
A complementary technology
along with Hadoop is the Hadoop Distributed File System (HDFS). The
HDFS is a storage system in which the input data is partitioned
across several servers. With the massive amount of data proliferating
the Web, companies like IBM, Google, Yahoo and many others are
building new technologies to sort it all. Core to that movement is
something called MapReduce, The framework that understands and
assigns work to the nodes in a cluster. In other words, operating on
the smaller bits, and then piecing results together to form the big
picture again has proven extremely successful.
Data sources turning into
Insight sources
Media/entertainment: The
media/entertainment industry moved to digital recording, production,
and delivery in the past five years and is now collecting large
amounts of rich content and user viewing behaviors.
Life sciences: Low-cost
gene sequencing (<$1,000) can generate tens of terabytes of
information that must be analyzed to look for genetic variations and
potential treatment effectiveness.
Transportation, logistics,
retail, utilities, and telecommunications: Sensor data is being
generated at an accelerating rate from fleet GPS trans-receivers,
RFID tag readers, smart meters, and cell phones (call data records
[CDRs]); that data is used to optimize operations and drive
operational business intelligence (BI) to realize immediate business
opportunities.
Social Media: Doing
analytics over social media means doing analytics over huge data that
is flowing through the internet such as on Facebook, twitter, blogs
and forums, etc.. and to understand at a deep level what attributes
are being associated with your brand, and if they are reflective of
the goals that you set for yourself? It not only provides you a way
to view your brand image, it also provides a platform where you can
perform competitive analysis, influential analysis and figure out
virality from this huge data. Many organizations at this point
require use of sophisticated sentiment analysis to identify public
opinion about their company, market, or the economy as a whole.
Log Analysis: Almost 30% of the
data in an organization is for logs. Multiple applications, severs
and networks generate tremendous amounts of data. These log files can
grow to be hundreds of gigabytes in size. Log files contain important
information, about what is the performance of the network or servers,
information about security threats or breach, data which can be
useful for auditing, mining can be performed to detect certain
patterns out of the logs that will help in further analysis such as
root cause analysis, prediction, auditing, etc..
Healthcare: The healthcare
industry is quickly moving to electronic medical records and images,
which it wants to use for short-term public health monitoring and
long-term epidemiological research programs.
Smart instrumentation: The
use of intelligent meters in "smart grid" energy systems
that shift from a monthly meter read to an "every 15 minute"
meter read can translate into a multi-thousandfold increase in data
generated.
Ad Targeting: The amount
of data that is analyzed to decide which ads should display on your
favorite web site is staggering. Large advertising networks also have
to collect data on millions of clicks and provide useful information
to their clients about how users are responding to those ads.
Analyzing this data to determine how to best serve ads is another
ideal application for Hadoop. Advertising networks like Acknowledge
use Hadoop to analyze the millions of clicks and determine which ads
to display to the user.
Financial Analysis: The
finance industry is another segment that generates large volumes of
data. Hadoop can help solve financial problems by analyzing large
sets of transactions or stock prices. It can be used to spot trends
that might suggest fraud, Consider for instance the potential of
correlating Point of Sale data (available to a credit card issuer)
with web behavior analysis (either on the bank's site or externally),
and cross-examining it with other financial institutions or service
providers such as First Data or SWIFT. It also helps find ways to
improve the bottom line or discover trends in the stock market to
help investors pick better investments. It can be used to do
technical analysis of various stocks, and many other use cases for
financial sector.
Video surveillance: Video
surveillance is still transitioning from CCTV to IPTV cameras and
recording systems that organizations want to analyze for behavioral
patterns (security and service enhancement).
Email Analytics:
Organizations will large email archive requires efficient, flexible
storage, search and query capabilities across headers, subjects, body
and attachments. Other benefits such as classification of mails,
getting sentiment out of the emails from customer and clients add to
many businesses objectives.
Hadoop: the Big Answer to
the Big Questions of the Big Data!
Does it provide more
useful information? For example, a major retailer might implement a
digital video system throughout its stores, not only to monitor theft
but to implement a Big Data system to analyze the flow of shoppers,
including demographical information such as gender and age through
the store at different times of the day, week, and year. It could
also compare flows in different regions with core customer
demographics. This move makes it easier for the retailer to tune
layouts and promotion spaces on a store-by-store basis.
Does it improve the
fidelity of the information? For example, IDC spoke to several earth
science and medical epidemiological research teams using Big Data
systems to monitor and assess the quality of data being collected
from remote sensor systems; they are using Big Data not just to look
for patterns but to identify and eliminate false data caused by
malfunctions, user error, or temporary environmental anomalies.
Does it improve the
timeliness of the response? For example, several private and
government energy providers and financial organizations around the
world are deploying Big Data systems to reduce the time to detect
usage patterns or insurance fraud from months to days.
While it is pretty easy to
see that Hadoop is a powerful suite of tools, It is important to
evaluate as to where it might be useful or the kinds of tasks that
would benefit from it.
IBM InfoSphere BigInsights
builds on open source Apache Hadoop with IBM unique innovations
including a sophisticated text analytics module, IBM BigSheets for
data exploration, and a variety of performance, reliability, security
and administrative features. BigInsights is able to ingest and
analyze data in its native format, without imposing a
schema/structure, enabling fast ad-hoc analysis.
Computing large amount of
data is one aspect of big data, crunching the unstructured data to
get insights provides more value to the enterprise. Eighty percent of
the world’s data is unstructured, and most businesses don’t even
attempt to use this data to their advantage. Imagine if you could
afford to keep all the data generated by your business? Imagine if
you had a way to analyze that data? IBM BigInsights Text Analytics
component makes information extraction more scalable and easy to use.
The component is built around AQL, a declarative rule language with a
familiar SQL-like syntax. AQL replaces multiple obscure languages
typically used to build annotators. Because AQL is a declarative
language, rule developers can focus on what to extract, allowing its
cost-based optimizer to determine the most efficient execution plan
for the annotator.
Big Sheets is another
component of BigInsights Enterprise edition that is an insight engine
for line of business professionals that allows you to get insights
from web-scale data (really large data sets.) The "Big Sheets"
name was derived from the thought that users can use a "spreadsheet
metaphor" in a browser to analyze large sets of data. In
essence, it provides a big data worksheet and thus the name "Big
Sheets" came about for this project. By building on top of the
Hadoop infrastructure, Big Sheets is able to process large amounts of
data quickly and efficiently.
Conclusion
In addressing all three
dimensions of the big data challenge – Volume, Variety and
Velocity, If due diligence is performed we can identify aspects of
the application which can be crunched by Hadoop, as it’s not going
to replace your database, but your database isn’t likely to replace
Hadoop either. It is the combination of internal and external, and
structured and unstructured data from unrelated sources that has the
potential to truly revolutionize the industry.
Big data is more than a
challenge; it is an opportunity to find insight in new and emerging
types of data, to make your business more agile, and to answer
questions that, in the past, were beyond reach. Until now, there was
no practical way to harvest this opportunity. Today, IBM’s platform
for big data uses technologies like Hadoop to open the door to a
world of possibilities.
Forrester Research, Inc.
views Hadoop as "the open source heart of Big Data",
regarding it as "the nucleus of the next-generation EDW
[enterprise data warehouse] in the cloud," and has published its
first ever The Forrester Wave (tm): Enterprise Hadoop Solutions
report (February 2, 2012). This report evaluates 13 vendors against
15 criteria with IBM being placed in the Leaders category.
References
Villars R.L, Olofson C.W.,
“Big Data: What It Is and Why You Should Care”, June 2011,
Matthew Eastwood.
The Forrester Wave (tm):
Enterprise Hadoop Solutions report (February 2, 2012)