These days, every tech conference you attend or tech article you read, you will get bombarded with the term 'Big Data". Whether you are a business analyst or a technical developer, it is important to get your fundamentals right - to be able to evaluate if you have a big data problem to solve in the first place and how to go about solving it, to be able to understand some of the core terms and terminologies of the underlying technologies. To this purpose, you could start with this free ebook from IBM - "Understanding Big Data"- . This book gives you a good basic introduction and overview of Big Data from both the business angle as well as the technical angle.
At the outset - there's an interesting slice of past in the executive letter from Anjul Bhambri tracing the evolution of data management - giving the reader glimpses of data management technoglogies - tracing from the beginnings of SQL, to warehouses and datamarts, XML and finally to the now common-place big data problem and solutions. Later in the book, is also a quick looking back at the innovations that IBM has brought to the industry, right from the first magnetic hard disk in 1956 to the Scalable Parallel Systems in 1993 to Watson's Question Answering technology in 2011.
Part I is for the business reader and covers the Big Data use-cases such as
- Log Analytics (how IBM has helped a client analyze 1 TB of log data each day with less than 5 minutes latecy ),
- Fraud Detection (how a modern-day fraud detection ecosystem was worked out for a large credit card issuer and reduced their processing time from 3 weeks to a couple of hours) and
- Sentiment Analysis - how Cognos Consumer Insights a point solution that runs on BigInsight has helped a client discover and understand the sentiment surrounding its new product launch. There is also an personal "sentiment" related anectode.
There is also a comparison drawn and context set for data stored in warehouse and data stored in hadoop - this is the core to understanding the big data problem and the platform for understanding all the other related technolgies.
Part II is for the technical reader - The first chapter introduces Hadoop, HDFS, concepts of MapReduce and all of its eco-system technologies - Pig, Hive, JAQL, Flume, HBase, OOzie, Lucene and Avro.
Following this, the next chapter of the book deals in depth with some of the key differentiating features of the enterprise edition of IBM BigInsights - such as the
- Integrated graphical installation, configuration and administration
- Security
- Text Analytics
- GPFS-sNC - how you gain performance improvements and high availability when you store data here
- IBM LZO Compression - "how you can compress your data with a high-performance algorithm"
- Improved Workload scheduling
- Adaptive MapReduce
- Large scale indexing
- Data discovery and visualization using BigSheets for the business user.
And a peek into the upcoming features
- Machine Learning Analytics
The last chapters deals with IBM InfoSphere Streams - which talks about analytics in motion to analyze data in realtime with micro-latency. Several usecases with customers in industries such as Financial Services Sector, Health and Life Sciences (University of Ontario is working on a smareter neonatal critical care center) and Telecommunications (Globe Telecom), Defence (TerraEchos use case) are discussed.
To summarize - "IBM's goal here isn't to get you a running Hadoop cluster - that's something we do along the path; rather it's to give you a new way to gain insight into vast amounts of data that you haven't easily been able to tap into before; that is, until a technology like Hadoop got teamed with an analytics leader like IBM.'. You can start reading the chapters from whichever point interests you. An end-to-end read may take a couple of sittings but at the end you would have a broader picture of what constitutes a Big data problem and solution. Are there other books that is good for beginners to start reading about Big Data technologies? Do let us know.