Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Business intelligence on the cheap with Apache Hadoop and Dojo, Part 1: Crunch your existing data using Apache Hadoop

Feed a web-based reporting application

Michael Galpin, Software architect, eBay
Michael_Galpin
Michael Galpin is an architect at eBay and is a frequent contributor to developerWorks. He has spoken at various technical conferences, including JavaOne, EclipseCon and AjaxWorld. To get a preview of what he is working on, follow @michaelg on Twitter.

Summary:  Understanding your business is always important. Your company can be as agile as you want it to be, but if you do not know the right moves to make, you are driving with your eyes closed. Business intelligence solutions can be prohibitively expensive, and they often require you to retrofit your data to work with their systems. Open source technologies, however, make it easier than ever to create your own business intelligence reports. In this article, the first of a two-part series, learn how to crunch your existing data using Apache Hadoop and turn it into data that can be easily fed to a web-based reporting application.

View more content in this series

Date:  17 Aug 2010
Level:  Intermediate PDF:  A4 and Letter (232 KB | 11 pages)Get Adobe® Reader®
Also available in:   Chinese  Korean  Russian  Japanese

Activity:  17416 views
Comments:  

Prerequisites

In this article you will use Apache Hadoop to process large amounts of unwieldy data. For this article, Apache Hadoop 0.20 was used. To use Hadoop, you will need a Java development kit, and this article used JDK 1.6.0_20. Hadoop has some prerequisites of its own, such as having SSH and RSYNC installed. For its full list of requirements, please consult the Hadoop documentation. See Resources for links to these tools.

Apache Hadoop and business intelligence

No matter what kind of business you have, the importance of understanding your customers and how they interact with your software cannot be overstated. For start-ups and young companies, you need to understand what is working and what is not working so that you can iterate quickly and respond to your customers. This is true for more established companies as well, though for them it might be more of a matter of fine tuning the business, or testing new ideas. Whatever the case, there are several things that you will need to do to understand the behavior of your users.

Start by understanding your users

Develop skills on this topic

This content is part of a progressive knowledge path for advancing your skills. See Using NoSQL and analyzing big data

Sometimes understanding your users is completely obvious. At other times this can be the most difficult thing to figure out. This can be behavior, like what part of your web pages your users are clicking on. Or you might need to know if your users respond better to having more or less of the same type of data on a page. This could be the number of search results on a page or the amount of detail to show about an item. These aren't the only things that could be important to understand about your users, though. Maybe you would like to know what their location is. In this article, you will examine a case where you want to know what kind of web browser your users are using. If nothing else, this kind of information can be very useful to engineering teams, as it lets them optimize their development efforts.

Emit the necessary data

Once you know what you want to measure and analyze, you need to make sure your application is emitting the data needed to quantify your user's behavior and/or information. Now, you may be lucky and you are already emitting this data. For example, perhaps everything you need is already being emitted as part of some kind of transaction in your system and is being recorded to a database. Or perhaps it is already being written to an application or the system logs. Often neither of these will be the case, and you will need to either modify your system configuration or modify your application to log or record the information you need somehow.

For highly interactive web applications this could even include writing JavaScript to make Ajax calls or dynamically dropping beacons (typically 1x1 images with some event specific added to the query string of its URL) to capture how the user is interacting with your web application. In the example in this article, you want to capture the user agents of the browsers that are being used to access your web application. You might already be capturing this information. If not, then it is usually straightforward. For example, if you are using the Apache web server, then you just need to add a couple of lines to your httpd.conf file. Listing 1 shows a possible example of this.


Listing 1. Logging user agents with Apache web server

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" common
CustomLog /dev/logs/apache_access_log common

The configuration in Listing 1 is a fairly typical one for Apache. It logs a lot of useful information, like the user's IP address, the resource being requested (page, image, and so on), the referer (the page the user was on when he or she sent the request), and of course the user agent of his or her browser. You can tweak the LogFormat to log more or less.

Note: Emitting the necessary data can seem straightforward, but there are pitfalls. For example, if you need to capture a lot of user behavior, then you could wind up adding a lot of logging statements to your application. This can have some undesirable side effects. It may increase the amount of time to handle requests. For a web application, that means pages that are slower and less responsive, which is never a good thing. It may also mean that you need more application servers to handle the same number of requests, so there is an obvious cost here. You may want to consider using a logging system that is non-blocking, so that your end user never waits for you to log data. You may also want to only do detailed logging on a random fraction of requests and extrapolate from there. The increased logging may also require significantly more storage space for those logs. You may want to invest in separate hardware for logging. Again this is an obvious cost, but the added insight into your business brings significant value.

Crunch the data

Just getting all of the data persisted somewhere useful is the relatively easy step in this process. Now you need to process this data and turn it into something more useful. To get a good idea of how non-trivial this has traditionally been, just do a web search for business intelligence. Not only are there a lot of results, but there are a lot of paid results. There are many very expensive software packages out there that are designed to help you with this step (and the last step, reporting).

Even these costly software suites usually require integration efforts that can be quite complex (of course the vendors usually have a professional services division that can help you with this, just keep your checkbook out.) This is where Hadoop comes in. It is well suited for crunching large amounts of data. Just grab a terabyte of log files, a cluster of commodity servers, and get ready to roll. This is the step that is the focus of this article. In this example, you will write a Hadoop map/reduce job to turn Apache web server access log files into a small data set of useful information that can be easily consumed by a web application that creates interactive reports. This brings us to the last step of the process.

Creating reports

Once you have crunched all of the data, you may be left with very valuable information in a concise format, but it's probably sitting in a database or maybe as an XML file on a filer somewhere. That may be fine for you, but chances are that this data needs to be put into the hands of business analysts and executives in your company. They are going to expect a report of some sort that will be both interactive and visually appealing. If your report enables its end users to slice and dice the data in different ways, then not only will this be more useful to users, but it means that you won't have to go back and create yet another report that is based on the same data. Of course, everyone loves eye candy as well. In Part 2 of this series, I will show you how you can use the Dojo toolkit to do this. For now, we will concentrate on crunching data with Hadoop. Let's start processing those Apache logs.


Analyzing access logs

Hadoop is based on the map/reduce paradigm. The idea is to take some unwieldy set of data, transform it into just the data that you are interested in (this is the map step), and then aggregate the result (this is the reduce step). In this example you want to start with Apache access logs and turn this into a dataset that just contains how many requests you are getting from the various browsers. So the input to the process is a log file (or maybe multiple log files.) Listing 2 shows a sampling of what the content might look like.


Listing 2. Sample input content

127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/pro_timeEdition.jpg 
HTTP/1.1" 304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Windows; U;
Windows NT 6.1; it; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6 ( .NET CLR 3.5.30729)"
127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/welogo.gif HTTP/1.1" 304 -
 "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Macintosh; U; Intel Mac 
OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.458.1 Safari/534.3"
127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/madeonamac.gif HTTP/1.1" 
304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/4.0 (compatible; MSIE 
8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; Media Center PC 6.0; InfoPath.2; 
MS-RTC LM 8)"
127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/bullet.gif HTTP/1.1" 304 -
 "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Macintosh; U; Intel 
 Mac OS X 10_6_3; en-us) AppleWebKit/534.1+ (KHTML, like Gecko) Version/5.0 
 Safari/533.16"
127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/valid-xhtml10.png HTTP/1.1"
304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Macintosh; U; PPC 
Mac OS X 10.5; en-US; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4 GTB7.0"

This content could be in a single, very large log file, or there could be multiple log files that are the input to the system. Hadoop will run the map part on chunks of this data. The result will be an intermediate set of files that contain the output of the map step and are typically much smaller. Listing 3 shows this sample output.


Listing 3. Intermediate output created by map step

FIREFOX 1
CHROME 1
IE 1
SAFARI 1
FIREFOX 1

As you can tell from Listing 3, in the sample you will greatly simplify the problem by only worrying about four flavors of browsers: Microsoft Internet Explorer®, Mozilla Firefox®, Apple Safari®, and Google Chrome®. And you're not even worrying about the browser versions (don't do this at home, especially with Internet Explorer.) The reduce step is run on these files, aggregating the results into the final output. Listing 4 shows what the final output will look like.


Listing 4. Final output created by reduce step

{"IE":"8678891",
"FIREFOX":"4677201",
"CHROME":"2011558",
"SAFARI":"733549"}

The data shown in Listing 4 has not just been aggregated by the reduce step, but it has also been formatted into JSON. This will make it easy to use by your web-based reporting application. Of course, you can output into more raw data and have a server-side application access and provide it to your web application. Neither Hadoop nor Dojo place any restrictions or requirements on this step.

I hope this all seems simple and straightforward. Hadoop, however, is much more than just a framework that encapsulates the map/reduce paradigm. It implements this paradigm on a distributed basis. Those log files can be huge, but the work will be split up among the machines (nodes) in your Hadoop cluster. These clusters typically scale horizontally, so if you want to process more data or process it faster, just add more machines. Configuring your cluster and optimizing your configuration (like how to split the data so that it can be sent to the various machines) is a large topic on its own. We won't go into detail on this topic in this article, but will instead focus on the map, reduce, and formatting output parts. Let's start with the map phase.

Step 1: The map phase

The Hadoop runtime will split up the data (log files) that needs to be processed and give each node in your cluster a chunk of data. This data needs to have the map function applied to it. Listing 5 shows how you specify the map function for your log analyzer sample.


Listing 5. The Mapper for the access log sample

public class LogMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
      private static final Pattern regex = 
          Pattern.compile("(.*?)\"(.*?)\"(.*?)\"(.*?)\"(.*?)\"(.*?)\"");    
    @Override
    protected void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
        Matcher m = regex.matcher(value.toString());
        if (m.matches()){
            String ua = m.group(6).toLowerCase();
            Agents agent = IE; // default
            if (ua.contains("chrome")){
                agent = CHROME;
            } else if (ua.contains("safari")){
                agent = SAFARI;
            } else if (ua.contains("firefox")){
                agent = FIREFOX;
            }
            Text agentStr = new Text(agent.name());
            context.write(agentStr, one);
        }
    }
}

Hadoop makes extensive use of Java™ generics to provide a type-safe way to write the map and reduce functions. Listing 5 is an example of this. Notice that you extend the Hadoop framework class org.apache.hadoop.mapreduce.Mapper. This class takes a (key,value) pair as an input and maps that input to a new (key,value) pair. It is parameterized based on the types of the input and output (key,value) pairs. So in this case it is accepting a (key,value) pair of type (Object, Text) and mapping it to a (key,value) pair of type (Text,IntWritable). There is a single method to implement called map (which is parameterized based on the types of the (key,value) pair that are the input).

For the sample in Listing 5, a single line from the log file (as seen in Listing 2) will be the input Text object that is passed into the map method. You run a regular expression against it to extract the user agent string. Then you use string matching to map the user agent string to an Enum called Agents. Finally, this data is written to the Context object (whose write method is parameterized based on the types of the (key,value) pair that are produced by map). That write statement will produce a single line like the ones in Listing 3. Now you are ready for the reduce phase.

Step 2: The reduce phase

The Hadoop framework will take the output of the map phase and write it to intermediate files that will serve as the input for the reduce phase. This is where the data is aggregated into something more useful. Listing 6 shows how you specify the reduce function.


Listing 6. The Reducer for the access log sample

public class AgentSumReducer extends 
    Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    @Override
    public void reduce(Text key, Iterable<IntWritable> values, 
          Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

This time you are extending the org.apache.hadoop.mapreduce.Reducer class and implementing its reduce method. As you can see in Listing 6, generics are once again leveraged to make all of this code type safe. However, you must make sure that types produced by the Mapper are the same types that are being consumed by the Reducer. The reduce method is passed an Iterable whose type will be the type of the (key, value) pairs that are input to it. All you are doing here is adding up those values. Then you write them to the Context object, just as you did inside the Mapper. Now you are ready to format the output.

Step 3: Formatting the output

This step is actually optional. You could just use the raw output from Hadoop (a name and value on each line, separated by a space). You know, however, that you want to make this data available to the web application for reporting, so you are going to format it as JSON, as you saw in Listing 4. In Listing 7, you can see how this is specified.


Listing 7. The OutputFormat for the access log sample

public class JsonOutputFormat extends TextOutputFormat<Text, IntWritable> {
    @Override
    public RecordWriter<Text, IntWritable> getRecordWriter(
            TaskAttemptContext context) throws IOException, 
                  InterruptedException {
        Configuration conf = context.getConfiguration();
        Path path = getOutputPath(context);
        FileSystem fs = path.getFileSystem(conf);
        FSDataOutputStream out = 
                fs.create(new Path(path,context.getJobName()));
        return new JsonRecordWriter(out);
    }

    private static class JsonRecordWriter extends 
          LineRecordWriter<Text,IntWritable>{
        boolean firstRecord = true;
        @Override
        public synchronized void close(TaskAttemptContext context)
                throws IOException {
            out.writeChar('{');
            super.close(null);
        }

        @Override
        public synchronized void write(Text key, IntWritable value)
                throws IOException {
            if (!firstRecord){
                out.writeChars(",\r\n");
                firstRecord = false;
            }
            out.writeChars("\"" + key.toString() + "\":\""+
                    value.toString()+"\"");
        }

        public JsonRecordWriter(DataOutputStream out) 
                throws IOException{
            super(out);
            out.writeChar('}');
        }
    }
}

The code in Listing 7 might look a little complicated, but it's actually straightforward. You are extending the class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat, a utility class for outputting data as text (note the use of generics once again—make sure they match the types of the (key,value) pairs being produced by your Reducer class). The only thing you need to do is implement the getRecordWriter method, which will return an org.apache.hadoop.mapreduce.RecordWriter (parameterized again) instance. You are returning an instance of JsonRecordWriter, an inner class that will get a line of data from Listing 3 and produce a line of data in Listing 4. This will produce the JSON data that will now be ready to be consumed by the Dojo-based code in the reporting application.

Stay tuned for part two of this series to see how to make interactive reports using Dojo that consume the intelligent data that has been produced here by Hadoop.


Conclusion

This article has shown you a simple example of crunching big data with Hadoop. This is an easy way to process large amounts of data that can provide you valuable insight into your business. In this article, you have used Hadoop's map/reduce capabilities directly. If you start using Hadoop for many use cases, you will definitely want to look at higher order frameworks that are built on top of Hadoop and make it easier to write map/reduce jobs. Two such excellent, open source frameworks are Pig and Hive, both of which are also Apache projects. Both use a more declarative syntax with much less programming. Pig is a data-flow language, developed by members of the core Hadoop team at Yahoo®, while Facebook is more SQL-like and was developed at Facebook. You can use either of these, or the more base map/reduce jobs, to produce output that can be consumed by a web application.



Download

DescriptionNameSizeDownload method
Article source codeAccessLogAnalyzer.zip6KBHTTP

Information about download methods


Resources

Learn

Get products and technologies

  • Get Apache Hadoop. This article used version 0.20.

  • Download the Dojo toolkit.

  • Get the Java SDK. JDK 1.6.0_17 was used in this article.

  • Take a look at the Pig and Hive frameworks that are built on top of Hadoop.

  • Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

About the author

Michael_Galpin

Michael Galpin is an architect at eBay and is a frequent contributor to developerWorks. He has spoken at various technical conferences, including JavaOne, EclipseCon and AjaxWorld. To get a preview of what he is working on, follow @michaelg on Twitter.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development
ArticleID=508387
ArticleTitle=Business intelligence on the cheap with Apache Hadoop and Dojo, Part 1: Crunch your existing data using Apache Hadoop
publish-date=08172010

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers