Everyday, we create 2.5 quintillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: from sensors used to gather climate information, posts to social media sites, digital pictures and videos posted online, transaction records of online purchases, and from cell phone GPS signals to name a few. This data is big data.
Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
A complementary technology along with Hadoop is the Hadoop Distributed File System (HDFS). The HDFS is a storage system in which the input data is partitioned across several servers. With the massive amount of data proliferating the Web, companies like IBM, Google, Yahoo and many others are building new technologies to sort it all. Core to that movement is something called MapReduce, The framework that understands and assigns work to the nodes in a cluster. In other words, operating on the smaller bits, and then piecing results together to form the big picture again has proven extremely successful.
Data sources turning into Insight sources
Media/entertainment: The media/entertainment industry moved to digital recording, production, and delivery in the past five years and is now collecting large amounts of rich content and user viewing behaviors.
Life sciences: Low-cost gene sequencing (<$1,000) can generate tens of terabytes of information that must be analyzed to look for genetic variations and potential treatment effectiveness.
Transportation, logistics, retail, utilities, and telecommunications: Sensor data is being generated at an accelerating rate from fleet GPS trans-receivers, RFID tag readers, smart meters, and cell phones (call data records [CDRs]); that data is used to optimize operations and drive operational business intelligence (BI) to realize immediate business opportunities.
Social Media: Doing analytics over social media means doing analytics over huge data that is flowing through the internet such as on Facebook, twitter, blogs and forums, etc.. and to understand at a deep level what attributes are being associated with your brand, and if they are reflective of the goals that you set for yourself? It not only provides you a way to view your brand image, it also provides a platform where you can perform competitive analysis, influential analysis and figure out virality from this huge data. Many organizations at this point require use of sophisticated sentiment analysis to identify public opinion about their company, market, or the economy as a whole.
Log Analysis: Almost 30% of the data in an organization is for logs. Multiple applications, severs and networks generate tremendous amounts of data. These log files can grow to be hundreds of gigabytes in size. Log files contain important information, about what is the performance of the network or servers, information about security threats or breach, data which can be useful for auditing, mining can be performed to detect certain patterns out of the logs that will help in further analysis such as root cause analysis, prediction, auditing, etc..
Healthcare: The healthcare industry is quickly moving to electronic medical records and images, which it wants to use for short-term public health monitoring and long-term epidemiological research programs.
Smart instrumentation: The use of intelligent meters in "smart grid" energy systems that shift from a monthly meter read to an "every 15 minute" meter read can translate into a multi-thousandfold increase in data generated.
Ad Targeting: The amount of data that is analyzed to decide which ads should display on your favorite web site is staggering. Large advertising networks also have to collect data on millions of clicks and provide useful information to their clients about how users are responding to those ads. Analyzing this data to determine how to best serve ads is another ideal application for Hadoop. Advertising networks like Acknowledge use Hadoop to analyze the millions of clicks and determine which ads to display to the user.
Financial Analysis: The finance industry is another segment that generates large volumes of data. Hadoop can help solve financial problems by analyzing large sets of transactions or stock prices. It can be used to spot trends that might suggest fraud, Consider for instance the potential of correlating Point of Sale data (available to a credit card issuer) with web behavior analysis (either on the bank's site or externally), and cross-examining it with other financial institutions or service providers such as First Data or SWIFT. It also helps find ways to improve the bottom line or discover trends in the stock market to help investors pick better investments. It can be used to do technical analysis of various stocks, and many other use cases for financial sector.
Video surveillance: Video surveillance is still transitioning from CCTV to IPTV cameras and recording systems that organizations want to analyze for behavioral patterns (security and service enhancement).
Email Analytics: Organizations will large email archive requires efficient, flexible storage, search and query capabilities across headers, subjects, body and attachments. Other benefits such as classification of mails, getting sentiment out of the emails from customer and clients add to many businesses objectives.
Hadoop: the Big Answer to the Big Questions of the Big Data!
Does it provide more useful information? For example, a major retailer might implement a digital video system throughout its stores, not only to monitor theft but to implement a Big Data system to analyze the flow of shoppers, including demographical information such as gender and age through the store at different times of the day, week, and year. It could also compare flows in different regions with core customer demographics. This move makes it easier for the retailer to tune layouts and promotion spaces on a store-by-store basis.
Does it improve the fidelity of the information? For example, IDC spoke to several earth science and medical epidemiological research teams using Big Data systems to monitor and assess the quality of data being collected from remote sensor systems; they are using Big Data not just to look for patterns but to identify and eliminate false data caused by malfunctions, user error, or temporary environmental anomalies.
Does it improve the timeliness of the response? For example, several private and government energy providers and financial organizations around the world are deploying Big Data systems to reduce the time to detect usage patterns or insurance fraud from months to days.
While it is pretty easy to see that Hadoop is a powerful suite of tools, It is important to evaluate as to where it might be useful or the kinds of tasks that would benefit from it.
IBM InfoSphere BigInsights builds on open source Apache Hadoop with IBM unique innovations including a sophisticated text analytics module, IBM BigSheets for data exploration, and a variety of performance, reliability, security and administrative features. BigInsights is able to ingest and analyze data in its native format, without imposing a schema/structure, enabling fast ad-hoc analysis.
Computing large amount of data is one aspect of big data, crunching the unstructured data to get insights provides more value to the enterprise. Eighty percent of the world’s data is unstructured, and most businesses don’t even attempt to use this data to their advantage. Imagine if you could afford to keep all the data generated by your business? Imagine if you had a way to analyze that data? IBM BigInsights Text Analytics component makes information extraction more scalable and easy to use. The component is built around AQL, a declarative rule language with a familiar SQL-like syntax. AQL replaces multiple obscure languages typically used to build annotators. Because AQL is a declarative language, rule developers can focus on what to extract, allowing its cost-based optimizer to determine the most efficient execution plan for the annotator.
Big Sheets is another component of BigInsights Enterprise edition that is an insight engine for line of business professionals that allows you to get insights from web-scale data (really large data sets.) The "Big Sheets" name was derived from the thought that users can use a "spreadsheet metaphor" in a browser to analyze large sets of data. In essence, it provides a big data worksheet and thus the name "Big Sheets" came about for this project. By building on top of the Hadoop infrastructure, Big Sheets is able to process large amounts of data quickly and efficiently.
In addressing all three dimensions of the big data challenge – Volume, Variety and Velocity, If due diligence is performed we can identify aspects of the application which can be crunched by Hadoop, as it’s not going to replace your database, but your database isn’t likely to replace Hadoop either. It is the combination of internal and external, and structured and unstructured data from unrelated sources that has the potential to truly revolutionize the industry.
Big data is more than a challenge; it is an opportunity to find insight in new and emerging types of data, to make your business more agile, and to answer questions that, in the past, were beyond reach. Until now, there was no practical way to harvest this opportunity. Today, IBM’s platform for big data uses technologies like Hadoop to open the door to a world of possibilities.
Forrester Research, Inc. views Hadoop as "the open source heart of Big Data", regarding it as "the nucleus of the next-generation EDW [enterprise data warehouse] in the cloud," and has published its first ever The Forrester Wave (tm): Enterprise Hadoop Solutions report (February 2, 2012). This report evaluates 13 vendors against 15 criteria with IBM being placed in the Leaders category.
Villars R.L, Olofson C.W., “Big Data: What It Is and Why You Should Care”, June 2011, Matthew Eastwood.
The Forrester Wave (tm):
Enterprise Hadoop Solutions report (February 2, 2012)