Business intelligence on the cheap with Apache Hadoop and Dojo, Part 1: Crunch your existing data using Apache Hadoop

Feed a web-based reporting application

Understanding your business is always important. Your company can be as agile as you want it to be, but if you do not know the right moves to make, you are driving with your eyes closed. Business intelligence solutions can be prohibitively expensive, and they often require you to retrofit your data to work with their systems. Open source technologies, however, make it easier than ever to create your own business intelligence reports. In this article, the first of a two-part series, learn how to crunch your existing data using Apache Hadoop and turn it into data that can be easily fed to a web-based reporting application.

Michael Galpin, Software architect, eBay

Michael_GalpinMichael Galpin is an architect at eBay and is a frequent contributor to developerWorks. He has spoken at various technical conferences, including JavaOne, EclipseCon and AjaxWorld. To get a preview of what he is working on, follow @michaelg on Twitter.



17 August 2010

Also available in Chinese Russian Japanese

Prerequisites

In this article you will use Apache Hadoop to process large amounts of unwieldy data. For this article, Apache Hadoop 0.20 was used. To use Hadoop, you will need a Java development kit, and this article used JDK 1.6.0_20. Hadoop has some prerequisites of its own, such as having SSH and RSYNC installed. For its full list of requirements, please consult the Hadoop documentation. See Resources for links to these tools.

Apache Hadoop and business intelligence

No matter what kind of business you have, the importance of understanding your customers and how they interact with your software cannot be overstated. For start-ups and young companies, you need to understand what is working and what is not working so that you can iterate quickly and respond to your customers. This is true for more established companies as well, though for them it might be more of a matter of fine tuning the business, or testing new ideas. Whatever the case, there are several things that you will need to do to understand the behavior of your users.

Start by understanding your users

Develop skills on this topic

This content is part of a progressive knowledge path for advancing your skills. See Using NoSQL and analyzing big data

Sometimes understanding your users is completely obvious. At other times this can be the most difficult thing to figure out. This can be behavior, like what part of your web pages your users are clicking on. Or you might need to know if your users respond better to having more or less of the same type of data on a page. This could be the number of search results on a page or the amount of detail to show about an item. These aren't the only things that could be important to understand about your users, though. Maybe you would like to know what their location is. In this article, you will examine a case where you want to know what kind of web browser your users are using. If nothing else, this kind of information can be very useful to engineering teams, as it lets them optimize their development efforts.

Emit the necessary data

Once you know what you want to measure and analyze, you need to make sure your application is emitting the data needed to quantify your user's behavior and/or information. Now, you may be lucky and you are already emitting this data. For example, perhaps everything you need is already being emitted as part of some kind of transaction in your system and is being recorded to a database. Or perhaps it is already being written to an application or the system logs. Often neither of these will be the case, and you will need to either modify your system configuration or modify your application to log or record the information you need somehow.

For highly interactive web applications this could even include writing JavaScript to make Ajax calls or dynamically dropping beacons (typically 1x1 images with some event specific added to the query string of its URL) to capture how the user is interacting with your web application. In the example in this article, you want to capture the user agents of the browsers that are being used to access your web application. You might already be capturing this information. If not, then it is usually straightforward. For example, if you are using the Apache web server, then you just need to add a couple of lines to your httpd.conf file. Listing 1 shows a possible example of this.

Listing 1. Logging user agents with Apache web server
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" common
CustomLog /dev/logs/apache_access_log common

The configuration in Listing 1 is a fairly typical one for Apache. It logs a lot of useful information, like the user's IP address, the resource being requested (page, image, and so on), the referer (the page the user was on when he or she sent the request), and of course the user agent of his or her browser. You can tweak the LogFormat to log more or less.

Note: Emitting the necessary data can seem straightforward, but there are pitfalls. For example, if you need to capture a lot of user behavior, then you could wind up adding a lot of logging statements to your application. This can have some undesirable side effects. It may increase the amount of time to handle requests. For a web application, that means pages that are slower and less responsive, which is never a good thing. It may also mean that you need more application servers to handle the same number of requests, so there is an obvious cost here. You may want to consider using a logging system that is non-blocking, so that your end user never waits for you to log data. You may also want to only do detailed logging on a random fraction of requests and extrapolate from there. The increased logging may also require significantly more storage space for those logs. You may want to invest in separate hardware for logging. Again this is an obvious cost, but the added insight into your business brings significant value.

Crunch the data

Just getting all of the data persisted somewhere useful is the relatively easy step in this process. Now you need to process this data and turn it into something more useful. To get a good idea of how non-trivial this has traditionally been, just do a web search for business intelligence. Not only are there a lot of results, but there are a lot of paid results. There are many very expensive software packages out there that are designed to help you with this step (and the last step, reporting).

Even these costly software suites usually require integration efforts that can be quite complex (of course the vendors usually have a professional services division that can help you with this, just keep your checkbook out.) This is where Hadoop comes in. It is well suited for crunching large amounts of data. Just grab a terabyte of log files, a cluster of commodity servers, and get ready to roll. This is the step that is the focus of this article. In this example, you will write a Hadoop map/reduce job to turn Apache web server access log files into a small data set of useful information that can be easily consumed by a web application that creates interactive reports. This brings us to the last step of the process.

Creating reports

Once you have crunched all of the data, you may be left with very valuable information in a concise format, but it's probably sitting in a database or maybe as an XML file on a filer somewhere. That may be fine for you, but chances are that this data needs to be put into the hands of business analysts and executives in your company. They are going to expect a report of some sort that will be both interactive and visually appealing. If your report enables its end users to slice and dice the data in different ways, then not only will this be more useful to users, but it means that you won't have to go back and create yet another report that is based on the same data. Of course, everyone loves eye candy as well. In Part 2 of this series, I will show you how you can use the Dojo toolkit to do this. For now, we will concentrate on crunching data with Hadoop. Let's start processing those Apache logs.


Analyzing access logs

Hadoop is based on the map/reduce paradigm. The idea is to take some unwieldy set of data, transform it into just the data that you are interested in (this is the map step), and then aggregate the result (this is the reduce step). In this example you want to start with Apache access logs and turn this into a dataset that just contains how many requests you are getting from the various browsers. So the input to the process is a log file (or maybe multiple log files.) Listing 2 shows a sampling of what the content might look like.

Listing 2. Sample input content
127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/pro_timeEdition.jpg 
HTTP/1.1" 304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Windows; U;
Windows NT 6.1; it; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6 ( .NET CLR 3.5.30729)"
127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/welogo.gif HTTP/1.1" 304 -
 "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Macintosh; U; Intel Mac 
OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.458.1 Safari/534.3"
127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/madeonamac.gif HTTP/1.1" 
304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/4.0 (compatible; MSIE 
8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; Media Center PC 6.0; InfoPath.2; 
MS-RTC LM 8)"
127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/bullet.gif HTTP/1.1" 304 -
 "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Macintosh; U; Intel 
 Mac OS X 10_6_3; en-us) AppleWebKit/534.1+ (KHTML, like Gecko) Version/5.0 
 Safari/533.16"
127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/valid-xhtml10.png HTTP/1.1"
304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Macintosh; U; PPC 
Mac OS X 10.5; en-US; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4 GTB7.0"

This content could be in a single, very large log file, or there could be multiple log files that are the input to the system. Hadoop will run the map part on chunks of this data. The result will be an intermediate set of files that contain the output of the map step and are typically much smaller. Listing 3 shows this sample output.

Listing 3. Intermediate output created by map step
FIREFOX 1
CHROME 1
IE 1
SAFARI 1
FIREFOX 1

As you can tell from Listing 3, in the sample you will greatly simplify the problem by only worrying about four flavors of browsers: Microsoft Internet Explorer®, Mozilla Firefox®, Apple Safari®, and Google Chrome®. And you're not even worrying about the browser versions (don't do this at home, especially with Internet Explorer.) The reduce step is run on these files, aggregating the results into the final output. Listing 4 shows what the final output will look like.

Listing 4. Final output created by reduce step
{"IE":"8678891",
"FIREFOX":"4677201",
"CHROME":"2011558",
"SAFARI":"733549"}

The data shown in Listing 4 has not just been aggregated by the reduce step, but it has also been formatted into JSON. This will make it easy to use by your web-based reporting application. Of course, you can output into more raw data and have a server-side application access and provide it to your web application. Neither Hadoop nor Dojo place any restrictions or requirements on this step.

I hope this all seems simple and straightforward. Hadoop, however, is much more than just a framework that encapsulates the map/reduce paradigm. It implements this paradigm on a distributed basis. Those log files can be huge, but the work will be split up among the machines (nodes) in your Hadoop cluster. These clusters typically scale horizontally, so if you want to process more data or process it faster, just add more machines. Configuring your cluster and optimizing your configuration (like how to split the data so that it can be sent to the various machines) is a large topic on its own. We won't go into detail on this topic in this article, but will instead focus on the map, reduce, and formatting output parts. Let's start with the map phase.

Step 1: The map phase

The Hadoop runtime will split up the data (log files) that needs to be processed and give each node in your cluster a chunk of data. This data needs to have the map function applied to it. Listing 5 shows how you specify the map function for your log analyzer sample.

Listing 5. The Mapper for the access log sample
public class LogMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
      private static final Pattern regex = 
          Pattern.compile("(.*?)\"(.*?)\"(.*?)\"(.*?)\"(.*?)\"(.*?)\"");    
    @Override
    protected void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
        Matcher m = regex.matcher(value.toString());
        if (m.matches()){
            String ua = m.group(6).toLowerCase();
            Agents agent = IE; // default
            if (ua.contains("chrome")){
                agent = CHROME;
            } else if (ua.contains("safari")){
                agent = SAFARI;
            } else if (ua.contains("firefox")){
                agent = FIREFOX;
            }
            Text agentStr = new Text(agent.name());
            context.write(agentStr, one);
        }
    }
}

Hadoop makes extensive use of Java™ generics to provide a type-safe way to write the map and reduce functions. Listing 5 is an example of this. Notice that you extend the Hadoop framework class org.apache.hadoop.mapreduce.Mapper. This class takes a (key,value) pair as an input and maps that input to a new (key,value) pair. It is parameterized based on the types of the input and output (key,value) pairs. So in this case it is accepting a (key,value) pair of type (Object, Text) and mapping it to a (key,value) pair of type (Text,IntWritable). There is a single method to implement called map (which is parameterized based on the types of the (key,value) pair that are the input).

For the sample in Listing 5, a single line from the log file (as seen in Listing 2) will be the input Text object that is passed into the map method. You run a regular expression against it to extract the user agent string. Then you use string matching to map the user agent string to an Enum called Agents. Finally, this data is written to the Context object (whose write method is parameterized based on the types of the (key,value) pair that are produced by map). That write statement will produce a single line like the ones in Listing 3. Now you are ready for the reduce phase.

Step 2: The reduce phase

The Hadoop framework will take the output of the map phase and write it to intermediate files that will serve as the input for the reduce phase. This is where the data is aggregated into something more useful. Listing 6 shows how you specify the reduce function.

Listing 6. The Reducer for the access log sample
public class AgentSumReducer extends 
    Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    @Override
    public void reduce(Text key, Iterable<IntWritable> values, 
          Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

This time you are extending the org.apache.hadoop.mapreduce.Reducer class and implementing its reduce method. As you can see in Listing 6, generics are once again leveraged to make all of this code type safe. However, you must make sure that types produced by the Mapper are the same types that are being consumed by the Reducer. The reduce method is passed an Iterable whose type will be the type of the (key, value) pairs that are input to it. All you are doing here is adding up those values. Then you write them to the Context object, just as you did inside the Mapper. Now you are ready to format the output.

Step 3: Formatting the output

This step is actually optional. You could just use the raw output from Hadoop (a name and value on each line, separated by a space). You know, however, that you want to make this data available to the web application for reporting, so you are going to format it as JSON, as you saw in Listing 4. In Listing 7, you can see how this is specified.

Listing 7. The OutputFormat for the access log sample
public class JsonOutputFormat extends TextOutputFormat<Text, IntWritable> {
    @Override
    public RecordWriter<Text, IntWritable> getRecordWriter(
            TaskAttemptContext context) throws IOException, 
                  InterruptedException {
        Configuration conf = context.getConfiguration();
        Path path = getOutputPath(context);
        FileSystem fs = path.getFileSystem(conf);
        FSDataOutputStream out = 
                fs.create(new Path(path,context.getJobName()));
        return new JsonRecordWriter(out);
    }

    private static class JsonRecordWriter extends 
          LineRecordWriter<Text,IntWritable>{
        boolean firstRecord = true;
        @Override
        public synchronized void close(TaskAttemptContext context)
                throws IOException {
            out.writeChar('{');
            super.close(null);
        }

        @Override
        public synchronized void write(Text key, IntWritable value)
                throws IOException {
            if (!firstRecord){
                out.writeChars(",\r\n");
                firstRecord = false;
            }
            out.writeChars("\"" + key.toString() + "\":\""+
                    value.toString()+"\"");
        }

        public JsonRecordWriter(DataOutputStream out) 
                throws IOException{
            super(out);
            out.writeChar('}');
        }
    }
}

The code in Listing 7 might look a little complicated, but it's actually straightforward. You are extending the class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat, a utility class for outputting data as text (note the use of generics once again—make sure they match the types of the (key,value) pairs being produced by your Reducer class). The only thing you need to do is implement the getRecordWriter method, which will return an org.apache.hadoop.mapreduce.RecordWriter (parameterized again) instance. You are returning an instance of JsonRecordWriter, an inner class that will get a line of data from Listing 3 and produce a line of data in Listing 4. This will produce the JSON data that will now be ready to be consumed by the Dojo-based code in the reporting application.

Stay tuned for part two of this series to see how to make interactive reports using Dojo that consume the intelligent data that has been produced here by Hadoop.


Conclusion

This article has shown you a simple example of crunching big data with Hadoop. This is an easy way to process large amounts of data that can provide you valuable insight into your business. In this article, you have used Hadoop's map/reduce capabilities directly. If you start using Hadoop for many use cases, you will definitely want to look at higher order frameworks that are built on top of Hadoop and make it easier to write map/reduce jobs. Two such excellent, open source frameworks are Pig and Hive, both of which are also Apache projects. Both use a more declarative syntax with much less programming. Pig is a data-flow language, developed by members of the core Hadoop team at Yahoo®, while Facebook is more SQL-like and was developed at Facebook. You can use either of these, or the more base map/reduce jobs, to produce output that can be consumed by a web application.


Download

DescriptionNameSize
Article source codeAccessLogAnalyzer.zip6KB

Resources

Learn

Get products and technologies

  • Download IBM InfoSphere BigInsights Basic Edition at no charge and build a solution that turns large, complex volumes of data into insight by combining Apache Hadoop with unique technologies and capabilities from IBM.
  • Get Apache Hadoop. This article used version 0.20.
  • Download the Dojo toolkit.
  • Get the Java SDK. JDK 1.6.0_17 was used in this article.
  • Take a look at the Pig and Hive frameworks that are built on top of Hadoop.
  • Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Web development on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development, Big data and analytics
ArticleID=508387
ArticleTitle=Business intelligence on the cheap with Apache Hadoop and Dojo, Part 1: Crunch your existing data using Apache Hadoop
publish-date=08172010