Big Data For Dummies
JeanFrancoisPuget 2700028FGP Visits (13025)
Seems like I'm on a series on buzzwords. Today's pick is Big Data. Not bad for a buzzword isn't it? I'll even add to the mix other buzzwords such as NoSQL and Hadoop/MapReduce. I won't review all big data, but I will focus on few things that most people take for granted. I will also discuss an interesting computer science theorem (well, interesting to me at least).
Why am I writing about it? For two reasons mainly. First, Big Data and Analytics are converging. See for instance Big
There are probably as many definitions of Big Data as there are experts on the topic, but here is one thing most agree on. We speak of big data when the amount of data that an application deals with exceeds a petabyte, ie one million of gigabytes. This is huge, more than 60 times the US national debt (in dollars) if that makes sense to you. But big data is much more than merely amassing vast amount of data. In fact people have settled on three use cases for big data (they are not mutually exclusive):
As we can see, the aim of big data is to harness all sorts of data. As Paul Taylor, a colleague of mine, puts it, big data is about all your data. This is a significant extension to the corporate data that was mainly stored in relational databases until recently. Handling all kind of data spurred the development of new IT technologies and tools, which is what big data refers to.
A quite simple idea led to the best known tool in this area, namely Hadoop. Instead of bringing data to the computer system that will analyze it, Hadoop sends the computation to where the data is. Said differently, Hadoop is a tool to distribute computation across a distributed system storing the data to be analyzed. Hadoop also contains the distributed file system that you can use to store the data. Hadoop is an open
From a database perspective, Big Data means moving away from relational databases. This is captured by the NoSQL (Not Only SQL) acronym for databases.
SQL refers to relational databases. These databases where developed to ensure consistency of data through well defined transactions. More formally, the set of properties that guarantee that database transactions are processed reliably is know as ACID (Atomicity, Consistency, Isolation, Durability) in computer science.
The advent of the web introduced the need for distributed data bases at a large scale. Think of the data stored for Google search, Amazon, Ebay, or Facebook. Given the number of requests these sites get, one needs to replicate the data on many servers, then distribute queries to these servers. Can we then guarantee that the data stays consistent between all these servers while at the same time ensuring 24x7 responses to queries? The answer is that we cannot.
This was first conjectured by Eric Brewer in 2000, then proved formally by Seth Gilbert and Nancy Lynch of MIT. Brewer stated the problem as follows.
It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
This became known as the CAP theorem. Since any distributed system can only have two of the three features, looking at which pair is supported by a given system is a nice way to analyze it. For instance, a relational database, even a distributed one, will ensure consistency and availability, but isn't robust vis a vis partition fault. This post by Nathan Hurst provides a classification based on CAP of current NoSQL systems.
The CAP theorem is kind of fascinating, because it captures the very reasons why the distributed systems that power massive webscale systems such as Google, Amazon, Ebay, or social networks work in practice. This is well documented in The
There are many more tools available in big data beyond Hadoop and NoSQL databases. Readers willing to know more about these and other big data tools can go to the Big Data University site.
Edited on May 10 to add a link to Eric Brewer's article, found via Florian Bahr