Apache Hadoop and business intelligence
No matter what kind of business you have, the importance of understanding your customers and how they interact with your software cannot be overstated. For start-ups and young companies, you need to understand what is working and what is not working so that you can iterate quickly and respond to your customers. This is true for more established companies as well, though for them it might be more of a matter of fine tuning the business, or testing new ideas. Whatever the case, there are several things that you will need to do to understand the behavior of your users.
Start by understanding your users
Sometimes understanding your users is completely obvious. At other times this can be the most difficult thing to figure out. This can be behavior, like what part of your web pages your users are clicking on. Or you might need to know if your users respond better to having more or less of the same type of data on a page. This could be the number of search results on a page or the amount of detail to show about an item. These aren't the only things that could be important to understand about your users, though. Maybe you would like to know what their location is. In this article, you will examine a case where you want to know what kind of web browser your users are using. If nothing else, this kind of information can be very useful to engineering teams, as it lets them optimize their development efforts.
Once you know what you want to measure and analyze, you need to make sure your application is emitting the data needed to quantify your user's behavior and/or information. Now, you may be lucky and you are already emitting this data. For example, perhaps everything you need is already being emitted as part of some kind of transaction in your system and is being recorded to a database. Or perhaps it is already being written to an application or the system logs. Often neither of these will be the case, and you will need to either modify your system configuration or modify your application to log or record the information you need somehow.
For highly interactive web applications this could even include writing JavaScript to make Ajax calls or dynamically dropping beacons (typically 1x1 images with some event specific added to the query string of its URL) to capture how the user is interacting with your web application. In the example in this article, you want to capture the user agents of the browsers that are being used to access your web application. You might already be capturing this information. If not, then it is usually straightforward. For example, if you are using the Apache web server, then you just need to add a couple of lines to your httpd.conf file. Listing 1 shows a possible example of this.
Listing 1. Logging user agents with Apache web server
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" common
CustomLog /dev/logs/apache_access_log common
|
The configuration in Listing 1 is a fairly typical one for Apache. It logs a lot of useful information, like the user's IP address, the resource being requested (page, image, and so on), the referer (the page the user was on when he or she sent the request), and of course the user agent of his or her browser. You can tweak the LogFormat to log more or less.
Note: Emitting the necessary data can seem straightforward, but there are pitfalls. For example, if you need to capture a lot of user behavior, then you could wind up adding a lot of logging statements to your application. This can have some undesirable side effects. It may increase the amount of time to handle requests. For a web application, that means pages that are slower and less responsive, which is never a good thing. It may also mean that you need more application servers to handle the same number of requests, so there is an obvious cost here. You may want to consider using a logging system that is non-blocking, so that your end user never waits for you to log data. You may also want to only do detailed logging on a random fraction of requests and extrapolate from there. The increased logging may also require significantly more storage space for those logs. You may want to invest in separate hardware for logging. Again this is an obvious cost, but the added insight into your business brings significant value.
Just getting all of the data persisted somewhere useful is the relatively easy step in this process. Now you need to process this data and turn it into something more useful. To get a good idea of how non-trivial this has traditionally been, just do a web search for business intelligence. Not only are there a lot of results, but there are a lot of paid results. There are many very expensive software packages out there that are designed to help you with this step (and the last step, reporting).
Even these costly software suites usually require integration efforts that can be quite complex (of course the vendors usually have a professional services division that can help you with this, just keep your checkbook out.) This is where Hadoop comes in. It is well suited for crunching large amounts of data. Just grab a terabyte of log files, a cluster of commodity servers, and get ready to roll. This is the step that is the focus of this article. In this example, you will write a Hadoop map/reduce job to turn Apache web server access log files into a small data set of useful information that can be easily consumed by a web application that creates interactive reports. This brings us to the last step of the process.
Once you have crunched all of the data, you may be left with very valuable information in a concise format, but it's probably sitting in a database or maybe as an XML file on a filer somewhere. That may be fine for you, but chances are that this data needs to be put into the hands of business analysts and executives in your company. They are going to expect a report of some sort that will be both interactive and visually appealing. If your report enables its end users to slice and dice the data in different ways, then not only will this be more useful to users, but it means that you won't have to go back and create yet another report that is based on the same data. Of course, everyone loves eye candy as well. In Part 2 of this series, I will show you how you can use the Dojo toolkit to do this. For now, we will concentrate on crunching data with Hadoop. Let's start processing those Apache logs.
Hadoop is based on the map/reduce paradigm. The idea is to take some unwieldy set of data, transform it into just the data that you are interested in (this is the map step), and then aggregate the result (this is the reduce step). In this example you want to start with Apache access logs and turn this into a dataset that just contains how many requests you are getting from the various browsers. So the input to the process is a log file (or maybe multiple log files.) Listing 2 shows a sampling of what the content might look like.
Listing 2. Sample input content
127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/pro_timeEdition.jpg HTTP/1.1" 304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Windows; U; Windows NT 6.1; it; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6 ( .NET CLR 3.5.30729)" 127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/welogo.gif HTTP/1.1" 304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.458.1 Safari/534.3" 127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/madeonamac.gif HTTP/1.1" 304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; Media Center PC 6.0; InfoPath.2; MS-RTC LM 8)" 127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/bullet.gif HTTP/1.1" 304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-us) AppleWebKit/534.1+ (KHTML, like Gecko) Version/5.0 Safari/533.16" 127.0.0.1 - - [11/Jul/2010:15:25:29 -0700] "GET /MAMP/images/valid-xhtml10.png HTTP/1.1" 304 - "http://localhost:8888/MAMP/?language=English" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4 GTB7.0" |
This content could be in a single, very large log file, or there could be multiple log files that are the input to the system. Hadoop will run the map part on chunks of this data. The result will be an intermediate set of files that contain the output of the map step and are typically much smaller. Listing 3 shows this sample output.
Listing 3. Intermediate output created by map step
FIREFOX 1 CHROME 1 IE 1 SAFARI 1 FIREFOX 1 |
As you can tell from Listing 3, in the sample you will greatly simplify the problem by only worrying about four flavors of browsers: Microsoft Internet Explorer®, Mozilla Firefox®, Apple Safari®, and Google Chrome®. And you're not even worrying about the browser versions (don't do this at home, especially with Internet Explorer.) The reduce step is run on these files, aggregating the results into the final output. Listing 4 shows what the final output will look like.
Listing 4. Final output created by reduce step
{"IE":"8678891",
"FIREFOX":"4677201",
"CHROME":"2011558",
"SAFARI":"733549"}
|
The data shown in Listing 4 has not just been aggregated by the reduce step, but it has also been formatted into JSON. This will make it easy to use by your web-based reporting application. Of course, you can output into more raw data and have a server-side application access and provide it to your web application. Neither Hadoop nor Dojo place any restrictions or requirements on this step.
I hope this all seems simple and straightforward. Hadoop, however, is much more than just a framework that encapsulates the map/reduce paradigm. It implements this paradigm on a distributed basis. Those log files can be huge, but the work will be split up among the machines (nodes) in your Hadoop cluster. These clusters typically scale horizontally, so if you want to process more data or process it faster, just add more machines. Configuring your cluster and optimizing your configuration (like how to split the data so that it can be sent to the various machines) is a large topic on its own. We won't go into detail on this topic in this article, but will instead focus on the map, reduce, and formatting output parts. Let's start with the map phase.
The Hadoop runtime will split up the data (log files) that needs to be
processed and give each node in your cluster a chunk of data. This data
needs to have the map function applied to it.
Listing 5 shows how you specify the
map function for your log analyzer sample.
Listing 5. The
Mapper for the access log sample
public class LogMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private static final Pattern regex =
Pattern.compile("(.*?)\"(.*?)\"(.*?)\"(.*?)\"(.*?)\"(.*?)\"");
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
Matcher m = regex.matcher(value.toString());
if (m.matches()){
String ua = m.group(6).toLowerCase();
Agents agent = IE; // default
if (ua.contains("chrome")){
agent = CHROME;
} else if (ua.contains("safari")){
agent = SAFARI;
} else if (ua.contains("firefox")){
agent = FIREFOX;
}
Text agentStr = new Text(agent.name());
context.write(agentStr, one);
}
}
}
|
Hadoop makes extensive use of Java™ generics to provide a type-safe way to
write the map and reduce functions. Listing 5 is an example of this.
Notice that you extend the Hadoop framework
class org.apache.hadoop.mapreduce.Mapper. This
class takes a (key,value) pair as an input and
maps that input to a new (key,value) pair. It
is parameterized based on the types of the input and output
(key,value) pairs. So in this case it is
accepting a (key,value) pair of type
(Object, Text) and mapping it to a
(key,value) pair of type
(Text,IntWritable). There is a single method to
implement called map (which is parameterized
based on the types of the (key,value) pair that
are the input).
For the sample in Listing 5, a single line from the log file (as seen in
Listing 2) will be the input Text object that
is passed into the map method. You run a
regular expression against it to extract the user agent string. Then you
use string matching to map the user agent string to an
Enum called Agents.
Finally, this data is written to the Context
object (whose write method is parameterized
based on the types of the (key,value) pair that
are produced by map). That write statement will
produce a single line like the ones in Listing 3. Now you are ready
for the reduce phase.
The Hadoop framework will take the output of the map phase and write it to
intermediate files that will serve as the input for the reduce phase. This
is where the data is aggregated into something more useful. Listing 6 shows how you specify the
reduce function.
Listing 6. The
Reducer for the access log sample
public class AgentSumReducer extends
Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
|
This time you are extending the
org.apache.hadoop.mapreduce.Reducer class and
implementing its reduce method. As you can see
in Listing 6, generics are once again leveraged to make all of this code
type safe. However, you must make sure that types produced by the
Mapper are the same types that are being
consumed by the Reducer. The
reduce method is passed an
Iterable whose type will be the type of the
(key, value) pairs that are input to it. All you are doing here is adding
up those values. Then you write them to the
Context object, just as you did inside the
Mapper. Now you are ready to format the
output.
This step is actually optional. You could just use the raw output from Hadoop (a name and value on each line, separated by a space). You know, however, that you want to make this data available to the web application for reporting, so you are going to format it as JSON, as you saw in Listing 4. In Listing 7, you can see how this is specified.
Listing 7. The OutputFormat for the access log sample
public class JsonOutputFormat extends TextOutputFormat<Text, IntWritable> {
@Override
public RecordWriter<Text, IntWritable> getRecordWriter(
TaskAttemptContext context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
Path path = getOutputPath(context);
FileSystem fs = path.getFileSystem(conf);
FSDataOutputStream out =
fs.create(new Path(path,context.getJobName()));
return new JsonRecordWriter(out);
}
private static class JsonRecordWriter extends
LineRecordWriter<Text,IntWritable>{
boolean firstRecord = true;
@Override
public synchronized void close(TaskAttemptContext context)
throws IOException {
out.writeChar('{');
super.close(null);
}
@Override
public synchronized void write(Text key, IntWritable value)
throws IOException {
if (!firstRecord){
out.writeChars(",\r\n");
firstRecord = false;
}
out.writeChars("\"" + key.toString() + "\":\""+
value.toString()+"\"");
}
public JsonRecordWriter(DataOutputStream out)
throws IOException{
super(out);
out.writeChar('}');
}
}
}
|
The code in Listing 7 might look a little complicated,
but it's actually straightforward. You are extending the
class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat,
a utility class for outputting data as text (note the use of generics once
again—make sure they match the types of the (key,value) pairs being
produced by your Reducer class). The only thing
you need to do is implement the getRecordWriter
method, which will return an
org.apache.hadoop.mapreduce.RecordWriter
(parameterized again) instance. You are returning an instance of
JsonRecordWriter, an inner class that will get
a line of data from Listing 3 and produce a line of data
in Listing 4.
This will produce the JSON data that will now be ready to be consumed by
the Dojo-based code in the reporting application.
Stay tuned for part two of this series to see how to make interactive reports using Dojo that consume the intelligent data that has been produced here by Hadoop.
This article has shown you a simple example of crunching big data with Hadoop. This is an easy way to process large amounts of data that can provide you valuable insight into your business. In this article, you have used Hadoop's map/reduce capabilities directly. If you start using Hadoop for many use cases, you will definitely want to look at higher order frameworks that are built on top of Hadoop and make it easier to write map/reduce jobs. Two such excellent, open source frameworks are Pig and Hive, both of which are also Apache projects. Both use a more declarative syntax with much less programming. Pig is a data-flow language, developed by members of the core Hadoop team at Yahoo®, while Facebook is more SQL-like and was developed at Facebook. You can use either of these, or the more base map/reduce jobs, to produce output that can be consumed by a web application.
| Description | Name | Size | Download method |
|---|---|---|---|
| Article source code | AccessLogAnalyzer.zip | 6KB | HTTP |
Information about download methods
Learn
-
"Distributed
data processing with Hadoop"
(M. Tim Jones, developerWorks, 2010): Check out this article series for a
more detailed exploration of Hadoop.
- "Deriving new business insights with Big Data" (Stephen Watt,
developerWorks, June 2010): See how Hadoop can provide insights into your
business.
-
"Distributed computing with Linux and Hadoop" (Ken Mann and M. Tim
Jones, developerWorks, December 2008): Learn more about the inner workings
of Hadoop.
- IBM InfoSphere BigInsights Basic Edition -- IBM's Hadoop distribution -- is an integrated, tested and pre-configured, no-charge download for anyone who wants to experiment with and learn about Hadoop.
- Find free courses on Hadoop fundamentals, stream computing, text analytics, and more at Big Data University.
-
"Writing a custom Dojo application" (Wendi Nusbickel and Melissa
Betancourt, developerWorks, December 2008): Find out much more about Dojo.
-
"Develop HTML widgets with Dojo" (Igor Kusakov, developerWorks,
October 2006): Explore Dojo's extensibility.
- Model, simulate, execute, adapt, monitor,
and optimize your business processes with the
IBM
WebSphere® business process management (BPM) suite.
- The developerWorks Web development zone
specializes in articles covering various web-based solutions.
Get products and technologies
- Download IBM InfoSphere BigInsights Basic Edition at no charge and build a solution that turns large, complex volumes of data into insight by combining Apache Hadoop with unique technologies and capabilities from IBM.
- Get Apache Hadoop. This article used
version 0.20.
-
Download the
Dojo toolkit.
- Get the Java SDK.
JDK 1.6.0_17 was used in this article.
- Take a look at the
Pig and Hive frameworks that are
built on top of Hadoop.
- Download IBM product
evaluation versions, and get your hands on application development
tools and middleware products from DB2®, Lotus®, Rational®,
Tivoli®, and
WebSphere®.
Discuss
- Create your My developerWorks profile today and set up a watchlist on Hadoop.
Get connected and stay connected with My developerWorks.
- Find other developerWorks members interested in web development.
- Web developers, share your experience and knowledge in the Web development group.
- Share what you know: Join one of our developerWorks groups focused on web topics.
- Roland Barcia talks about Web 2.0 and middleware in his blog.
- Follow developerWorks' members' shared bookmarks on web topics.
- Get answers quickly: Visit the Web 2.0 Apps forum.

Michael Galpin is an architect at eBay and is a frequent contributor to developerWorks. He has spoken at various technical conferences, including JavaOne, EclipseCon and AjaxWorld. To get a preview of what he is working on, follow @michaelg on Twitter.




