|Amazon Web Services||
Topic: Data warehousing, scripting, flow management, application development
Environment/software: EC2, Amazon Elastic MapReduce, Apache Hive, Hadoop, Ruby, Python, Amazon S3, Java
Level/type: Intermediate to advanced/technical
I recently discovered a tech tool from IBM's jStart group of emerging technology specialists, a tool designed to help users implement something the group likes to call DIY (do-it-yourself) analytics -- tools and technologies so anyone can data mine without a lot of coding or overhead.
The tool that impressed me was called BigSheets, an extension of the mashup paradigm that:
- Allows you to collect and integrate tons of structured and unstructured data (petabytes of it).
- Lets you use your favorite unstructured information management architecture to prep all that disparate data into a usable form.
- Then lets you define the context in which you want to slice-and-dice and visualize that data.
In other words, BigSheets brings the tasks of collecting, converting, and parsing Big Data into Big Business Knowledge to the everyday user.
So, with turning Big Data into Big Knowledge on my mind, let me highlight two instructional gems from Amazon Web Services.
Getting started with Hive on Amazon Elastic MapReduce (video tutorial)
MapReduce is the software framework designed for processing really big data sets on distributable problems using lots of computers that are connected. Its advantage in the cloud is that it lets the user perform distributed processing of the Map (chops input into smaller problems and distributes them to nodes for processing) and Reduce (recombines the solved sub-problems into the answer) functions.
Amazon Elastic MapReduce is a web service that lets anyone easily process vast amounts of data. The service uses a hosted Hadoop framework running on EC2 and Amazon S3. Of course, being a cloud software, you can instantly provision the capacity you need for web indexing, data mining, and any Big Data analysis.
Hive is an Apache-based data warehouse infrastructure built atop Hadoop that gives the user the tools quickly summarize data and do ad hoc querying and analysis of large data sets. It provides a "structured" container for your data and an easy-to-use simple query language called Hive QL. In your data-mining tasks, Hive takes care of the simple analyses; MapReduce plugs in to enable more sophisticated analysis.
This video demonstrates how to use Apache Hive to operate a data warehouse with Amazon Elastic MapReduce. It also shows you the development of Hive script using an interactive job flow and illustrates how to deploy this script in Amazon S3 and how to run job flows to execute the script in batch mode.
- Familiarizes developers with two tools they can use to implement large-scale data mining on a cloud.
- Provides real-world examples of how to deploy these tools.
How to Create and Debug an Amazon Elastic MapReduce Job Flow (tutorial)
Learn to use Elastic MapReduce to develop, debug, and run job flows comprised of multiple steps. Job flows are user-defined tasks running as multiple instances on EC2 that Elastic MapReduce coordinates; each job flow can consist of one or more steps (like a MapReduce algorithm implemented as a Java
This is excellent instruction on how to craft mapping and reduction algorithms for job flows. The article also shows you how to install Ruby and the Elastic MapReduce command line interface and how to configure credentials (the user's credentials are used to calculate the signature value for every request you make).
- Introduces developers to the concepts needed for building mapping and reduction algorithms.
- Delivers instructions on aggregating MapReduce algorithms into stepped job flows.
- Demonstrates how Elastic MapReduce works to coordinate job flow steps.
- The original video: "Getting started with Hive on Amazon Elastic MapReduce."
- The original article: "How to Create and Debug an Amazon Elastic MapReduce Job Flow".
- Home of Apache Hive.
- More about Amazon Elastic MapReduce.
- Elastic MapReduce Getting Started Guide | Developer Guide | API Reference | Technical FAQ.
- For more on MapReduce functionality, try these articles: "Distributed data processing with Hadoop: Getting started" and "Distributed data processing with Hadoop: Going further."
- The "Data mining with WEKA" series also covers the use of MapReduce in do-it-yourself analytics.