©Copyright International Business Machines Corporation 2001. All rights reserved.
"...every click you make, every .gif you take, we'll be watching you..." -- Anonymous, and still in hiding.
There is a wealth of information gathered by the Web server (httpd) and stored in the log file. In this article we'll look at how to use the logs to improve the performance and usability of the web site. Unfortunately, many sites just archive or delete the logs. I'll describe how to load the logs into a database and then to query the database to see how people navigate the site and how site performance can be improved.
I chose a Linux clustered database as a proof point of Linux scalability and also to show that all the old Pentium machines in the office could be put to good use. The idea here is to make a pile of old hardware accomplish the work of a newer, faster machine. A database server that runs on a cluster addresses one aspect of clustered computing. Most of the Linux supercomputer clusters focus on numeric problems. This cluster addresses the commercial requirements of data storage and manipulation. In a cluster, all the computers work on the same query at the same time. The key question: as you add more computers, how much faster do you get the answer? Inter-node communication, data distribution, and load balancing are critical to cluster performance.
To demonstrate the load balancing issue I used the Parallel Virtual Machine (PVM) software common to Beowulf clusters. I used the IBM Informix(r) Extended Parallel Server™ (XPS) database for testing the cluster database scalability.
- Understanding the Web logs and loading them into the database
- Using Structured Query Language (SQL) to answer questions about the Web site
- Measuring the database cluster performance as more hardware is added
Understanding the Web logs and loading them into the database
The log files for the Apache Web server are found in the /var/log/httpd directory (all file names and locations are based on RedHat 6.2). You'll find a number of logs, the most current one being access_log. The default format of the file is:
192.168.1.142 - - [14/Nov/2000:16:27:21 -0500] "GET /time.html HTTP/1.1" 304 - |
The first field is the IP address of the client. The "-" which comes next is the rfc1413 identity check, which is rarely used because of the overhead associated with using the identity check capability. Next is any active .htaccess user id authentication--in this case, there is none. Enclosed in square brackets ([ ]) is a time stamp of the access. The request method, in this case a GET, was for the url /time.html using HTTP/1.1. The 304 is the Status Code for the access as documented in the RFC for HTTP. The final "-" is the number of bytes sent -- none in this case. The 304 return code says the object has not been modified so no bytes are sent.
The default Web server log provides lots of information, but it can mask many individual browsers operating from within the same Internet Service Provider (ISP). Cookies are the solution.
Consider the following three lines from an access_log file:
206.175.175.226 - - [15/Nov/2000:12:23:36 -0500] "GET / HTTP/1.0" 200 1233 206.175.175.226 - - [15/Nov/2000:12:24:15 -0500] "GET / HTTP/1.0" 200 1233 206.175.175.226 - - [15/Nov/2000:12:25:27 -0500] "GET / HTTP/1.0" 200 1233 |
It looks like one browser, identified by a single IP address, has accessed the "/", or root index, page three times. However, if logging is enabled with the following in the Apache httpd.conf configuration file, the above hits tell a very different story
# add user tracking
CookieTracking on
CustomLog logs/clickstream "%{cookie}n %r %t"
|
With logging enabled, we can look in the file called "clickstream" (as configured above) for the following log information:
206.175.175.226.1017974309016881 GET / HTTP/1.0 [15/Nov/2000:12:23:36 -0500] 206.175.175.226.1028974309055577 GET / HTTP/1.0 [15/Nov/2000:12:24:15 -0500] 206.175.175.226.1017974309127275 GET / HTTP/1.0 [15/Nov/2000:12:25:27 -0500] |
There were three completely different browsers accessing the site. This tracking is accomplished using a unique identifier (non-persistent cookie) for each browser. The number following the IP address is a user-tracking cookie that is sent by the server to the browser. Each time the browser makes a request, the cookie is sent back to the server by the browser.
Because the default installation does not have user tracking enabled, the examples in this article do not use CookieTracking. You can jump in with your current logs and do this analysis. You will want to turn on CookieTracking so all those clicks from "that large ISP" will become individual streams instead of one very busy user.
A note on privacy. A number of companies have been caught with their "hand in the cookie jar," so to speak. A quick Internet search yields 124,000 pages that contain the string "privacy" that also contain the string "cookies". This is a topic that has attracted some attention. Develop a privacy policy, post it to your site, and live by it.
Now that we have a handle on the log file, we need to load it into a database to run queries against the data. The log file strings are not friendly to a database high-speed loader so we will massage them into a delimited string. Dust off your favorite Practical Extraction and Report Language. The code you can download is in Perl. A quick disclaimer: my Perl code is not "good Perl style." The distinguishing features are: it works, and it has at least one comment line.
For the following input:
158.58.240.58 - - [04/Jul/2000:23:59:42 -0500] "GET / HTTP/1.1" 200 47507 203.127.32.40 - - [04/Jul/2000:23:59:25 -0500] "GET /jp/product/images/prod_top.gif \ HTTP/1.0" 304 - 203.127.32.40 - - [04/Jul/2000:23:59:24 -0500] "GET /jp/product/down.html \ HTTP/1.0" 200 12390 211.45.44.33 - - [04/Jul/2000:23:59:19 -0500] "GET /kr/images/top_banner.gif \ HTTP/1.1" 200 6719 211.45.44.33 - - [04/Jul/2000:23:59:24 -0500] "GET /kr/images/waytowin.gif \ HTTP/1.1" 200 71488 211.45.44.33 - - [04/Jul/2000:23:59:25 -0500] "GET /kr/train/new.gif \ HTTP/1.1" 200 416 |
The output from the Perl program looks like this:
158.58.240.58|-|-|2000-07-04 23:59:42||/|200|47507|-|1| 203.127.32.40|-|-|2000-07-04 23:59:25|jp|/jp/product/images/prod_top.gif|304||gif|4| 203.127.32.40|-|-|2000-07-04 23:59:24|jp|/jp/product/down.html|200|12390|html|3| 211.45.44.33|-|-|2000-07-04 23:59:19|kr|/kr/images/top_banner.gif|200|6719|gif|3| 211.45.44.33|-|-|2000-07-04 23:59:24|kr|/kr/images/waytowin.gif|200|71488|gif|3| 211.45.44.33|-|-|2000-07-04 23:59:25|kr|/kr/train/new.gif|200|416|gif|3| |
The last two fields, added by the Perl script are: 1) the object type, stripped from the end of the requested URL, and the 2) the link depth, which is the number of "/" characters in the url.
To perform the analysis we'll use a relational database running on a cluster. Relational databases are the most prevalent database architecture. Data is presented to the user in table format. There are a number of relational databases available for Linux, including DB2(r) Universal Database and IBM-Informix XPS Developers Kit. My examples use the XPS Developers Kit. XPS has implemented its own message-passing layer rather than using the more typical Beowulf implementations (for example, PVM, MPI). Feel free to use the database of your choice.
If you want to test a cluster using the XPS Developers Kit, you can get a copy by contacting Mohan Natraj (World Wide Marketing Manager for XPS) at mogs@us.ibm.com. Mohan's phone number is (503) 525-7419.
The installation procedure is well-documented in the Installation Guide and the release notes/README file. To make life easier, download the easyinst.tar file which automates the installation process. By reading the shell script you will see the exact steps to run to do the installation. By reading the installation documentation, you'll understand why the steps are needed.
Relational databases use Structured Query Language (SQL) to work with the data. A tutorial on Informix SQL is available if you need it. We need both data definition language (DDL) to create tables and data manipulation language (DML) to query the data in the tables.
First let's create a table to store the data. The SQL to do this looks like:
-- comment lines explain the code and any unique XPSisms
-- create a database called weblog
-- the "in workslice.1" specifies the storage for the database
create database weblog in workslice.1;
-- a raw table is a non-logging table used for data warehousing
-- this is similar to a DB2 CREATE TABLE statement
-- with the option NOT LOGGED INITIALLY
create raw table webfact (
ip char(15) ,
foo char(1) ,
bar char(1) ,
timestamp datetime year to second,
location char(16) ,
url varchar(255) ,
http_code char(3) ,
bytecount int,
objtype char(6),
linkdepth int
)
-- the "fragment" clause specifies where to
-- put the data. In our case the "workslice"
-- is a disk set across the cluster
fragment by hash(ip) in workslice;
|
To run this SQL from a file:
dbaccess -e -weblogDDL.sql |
Data loaders come in many different flavors. The XPS loader takes the relational table concept and maps it to Linux files and pipes. After the mapping has been defined with an external table you can use SQL to load, query, and unload data. The DDL for the external table is:
create external table e_webfact
-- the "sameas" saves re-typing the above DDL
sameas webfact
using (format "delimited",
REJECTFILE "/tmp/webfact.barf",
MAXERRORS 50000,
datafiles ("disk:1:/home/informix/work/dat/weblog.unl"));
|
Loading from the flat file of data defined in the "datafiles" clause is as simple as an insert-select. This is our first example of a DML statement.
-- now load insert into webfact select e_webfact.* from e_webfact; |
With data in the database we can now run queries to answer questions about the Web site.
Using SQL to answer questions about the Web site
Let's answer the following questions about the Web site with SQL:
- What are the most popular objects?
- Which objects consume the greatest site bandwidth?
- How many errors are generated and what is their nature?
- What is the distribution of activity over time?
- What links are bookmarked, bypassing the home page?
This is only a sample of the questions that can be answered. The logs used for the examples are samples from www.informix.com before the latest web site revisions.
1) What are the most popular objects?
The most popular objects belong on the home page, or a link to them belongs on the home page, to make navigation and access easier. The query is an easy one:
select
url,
count(*) hits
from webfact
group by 1
order by 2 desc;
|
What is the most popular link at Informix for one sample log file? The envelope please:
url / hits 242 url /informix/images/dot_clr.gif hits 208 url /informix/images/blackbar.gif hits 181 url /informix/absolut.css hits 170 |
2) Which objects consume the most bandwidth?
The objects with the largest value of hits * object size consume the most site bandwidth. Why do we care? If a large 24-bit color image is eating most of the bandwidth, you might want to reduce the size or quality of the image to improve performance.
The query we use to find out which objects are consuming the most bandwidth is more complex and creates a temporary table on the fly:
-- use the maximum memory available
set pdqpriority 100;
-- if another query is updating the table don't block
set isolation to dirty read;
-- drop the temporary table if it exists
drop table foo;
select
url,
max( bytecount) mbytes,
count(*) hits
from webfact
group by 1
-- create a temporary table, called foo, on the fly to store the
query output into temp foo;
select
url,
mbytes * hits totalbytes
from foo
order by 2 desc;
|
A sample from the Informix site yields:
url /answers/english/docs/220sdk/5194.pdf totalbytes 186935606 url /answers/english/docs/912ius/4818.pdf totalbytes 144750480 url /answers/english/docs/gn7382/4372.pdf totalbytes 106828470 url /answers/english/docs/73ids/4354.pdf totalbytes 56389410 |
3) How many errors are generated and what is their nature?
The first query below is a good quality-check of the site. The second query provides a list of urls and their errors in descending order, which gives a prioritized list of what to fix.
-- show summary of errors
select
http_code,
count(*) hits
from webfact
group by 1
order by 2 desc;
|
| http_code | class="c11">(count(*)) | |
| 200 | 1478884 | ok |
| 304 | 452508 | use cached copy |
| 404 | 163149 | not found |
| 302 | 55476 | moved temporarily |
| 206 | 42576 | partial content |
| 500 | 7157 | internal server err |
| 401 | 4166 | unauthorized |
| 123 | ? | |
| 403 | 25 | forbidden |
| 503 | 17 | service unavailable |
-- now look at the detail
select
url,
http_code,
count(*) hits
from webfact
group by 1,2
order by 3 desc;
|
I'll spare you the output from this one. It is only interesting if it comes from your own site.
4) What is the distribution of activity over time?
The international community requires that this site is available 7x24. The distribution of hits over time illustrates the demand for high availability in the Web environment.
select
-- get the hour of the access
extend ( timestamp , hour to hour ) hour_x_hit,
count(*) hits
from webfact
group by 1
order by 1 asc;
|
A graphical front end can make all the difference when presenting information. This graph, based on the above query, shows that the site is 7x24:
Figure 1. HTTP Hits by Time
5) What links are bookmarked, bypassing the home page?
If you build it they will come. And maybe even return for another visit. Browser bookmarks (or "favorites" if you use "that" browser) are your visitors' votes about the best content on the site. This content should be easily accessible. Putting links to the best content on the home page makes the home page more of a portal than an annoyance to avoid. Any customization or new information that is presented on the home page should be propagated to the popular pages to reach the returning bookmark audience.
The bookmark SQL is more complicated. We want the first object selected for a unique session. The UserTrack directive would be very helpful, but since this isn't available in the data we'll do the best we can. The comments in the code explain the stages of the query.
set isolation to dirty read;
set pdqpriority 90;
set explain on;
-- table to store bookmark results
--drop table bkmarks;
create raw table bkmarks (
ip char (15),
tstamp datetime year to second,
linkdepth int,
url varchar(255)
)
fragment by hash ( url ) in workslice;
select
distinct ip,
min( timestamp) tstamp
from webfact
-- filter for html pages, not .gif, .jpg ...
where objtype matches 'ht*'
group by 1
-- create a temporary table for the results
into temp foo;
-- now put a summary into the bookmarks table
insert into bkmarks
select
w.ip,
w.timestamp,
w.linkdepth,
w.url
-- notice that two tables are combined in the query
from webfact w,
foo f
-- the where clause allows the specification of how to
-- match-up the two tables and to do any desired filtering
where
w.ip = f.ip
and w.timestamp = f.tstamp
and objtype matches 'ht*'
;
select url, count(*) bookmark_count from bkmarks
group by 1
order by 2 desc
|
All the analysis so far can be performed on any database. Now we'll look at the server used for these examples and how it can be clustered.
Computer clusters use multiple computers to work on the same problem. There are two main reasons to cluster computers: performance and availability. Correspondingly, there are two dominant architectures for clusters: shared nothing architecture or a shared disk architecture. A shared nothing architecture is well suited to both tasks of performance and availability, a shared disk architecture lends itself more to addressing availability issues with limited scalability.
"Shared huh?" you might ask. The "sharing" refers to how the CPU and disk are allocated among the computers. The cluster of computers must define how they will work on the same task at the same time. In a shared nothing architecture, the computers all are assigned the problem to be solved and then work individually on data they control until they can return a result set to the user who launched the query. Each computer operates on the disk, memory, CPU, and data assigned to them. Conversely, a shared disk (or distributed lock manager) cluster uses a lock mechanism to arbitrate the cluster computers requests to a common disk repository of data. There is much debate about which architecture is better. I lean very strongly in favor of shared-nothing. An example of a shared-nothing supercomputer is the Search for Extraterrestrial Intelligence (SETI@home). Shared nothing databases are commercially available from IBM and Teradata. A distributed lock manager database (Cache Fusion) is available from Oracle. See Blair Adamache's article, Clustering for Scalability, for more on this topic.
Many Linux clusters are called Beowulf clusters. Rather than attempt a definition I offer you a quote from the FAQ (see http://www.beowulf.org/ for more information):
1. What's a Beowulf? [1999-05-13]
It is a kind of high-performance massively parallel computer built primarily out of commodity hardware components, running a free-software operating system like Linux or FreeBSD, and interconnected by a private high-speed network.
2. Where can I get the Beowulf software? [1999-05-13]
There is no software package called "Beowulf." There are, however, several pieces of software many people have found useful for building Beowulfs. None of them are essential. They include MPICH, LAM, PVM, the Linux kernel, and the channel-bonding patch to the Linux kernel.
A Beowulf Example - Load Balancing Issues
Before we jump into a database cluster lets take a look at a simple two-node Beowulf cluster using Parallel Virtual Machine (PVM) with an X-windows interface (XPVM.) The two nodes are asymmetric: one is a 133 MHz Pentium, the second is a 500 MHz Pentium III. The compute task is trivial: gzcat a compressed file and direct the output to /dev/null. The display shows an interesting facet of cluster computing: there is no load balancing inherent in the cluster.
Figure 2. Cluster computing
The bar chart shows the "piecrust" computer (133 MHz) still working with an elapsed time of 5.7 seconds, while the "poobah" server (500 MHz) has already finished. If the final result is dependent on the result set from each node, the throughput is throttled by the slowest computer. The same issue arises if the amount of work assigned to a symmetrical cluster is skewed towards a subset of the nodes.
My cluster for the click stream analysis was a very low budget affair. I used all the old Pentium 233 MHz machines in the office I could scrounge, IBM-Informix XPS for Linux, and a 100 megabit ethernet switch. I purloined one 366 MHz machine, but as explained above, it was limited by the rest of the herd. The illustration shows two of the machines, the monitor, and the ethernet switch.
Figure 3. Physically clustered machines with penguins

I focused the cluster testing on performance scalability. How much additional performance is achieved by adding additional processors? The perfect case is linear (100%) scalability. There are some confounding variables. Moving from one processor to two processors introduces the communications layer between the nodes, sending the queries to the nodes (function shipping), and consolidating the result set to the requestor node. This tends to reduce scalability. This is a one-time penalty when moving from a single node to multiple nodes. Working to increase scalability is "super-linear speedup." As nodes are added more Random Access Memory (RAM) is available for in-memory data structures. RAM is so much faster than disk that 2 * n nodes might result in job times shorter than T / 2. Why? When the total data is much larger than the available RAM, the database will use disk space to store temporary results -- much like virtual memory paging in an operating system. As nodes are added to the cluster, their RAM becomes available to process the query. If enough nodes are added to allow all the data structures to reside in the RAM of all the machines, the swapping is avoided. Things speed up quite a bit when this happens.
The graph below shows run times for two, three, four, and five processors. The results are nearly linear. True 100% scalability would produce straight lines with constant slope. There is a mixed query workload and not all queries provide the same scalability. These tests were performed with a 100 megabit switch. The test case with five computers shows a slowdown on some queries. The fifth computer had no local data disk attached, so some of the SQL caused enough traffic to overload the 100 megabit switch.
Figure 4. Test cycle of coservers
It is easy to test the impact of network speed on cluster performance. I substituted a 10 megabit hub for the switch and re-ran the same suite of tests. Here we see ethernet at its worst. We have near linear slowdown instead of speed-up. The more packets flying around the wire, the more collisions and the slower we go. The Beowulf FAQ refers to a channel-bonding patch to make multiple Network Interface Cards (NICs) function as a single high-speed interface. This graph dramatically demonstrates why a high-speed connection is critical to the cluster.
Figure 5. Test cycle of 2, 3 and 4 nodes
Web log analysis can make significant performance and usability improvements to your Web site. Linux database clusters have tremendous potential. Please do try this at home - although your mileage may vary!
As you tune for performance (see Performance Guide), remember that there is now an additional major resource to measure and optimize: the normal CPU, disk, memory, and now interconnect. An easy way to get going on a two-node system without the expense of a 100 megabit switch (about $100) is a crossover cable (about $10). The cable will connect two machines to each other without a hub/switch.
Please let me know about your Linux database cluster experiences by contacting me at lurie@us.ibm.com (this is not a tech support email).
DB2 and IBM are trademarks or registered trademarks of the IBM Corporation in the United States and/or other countries.
Informix and Informix Extended Parallel Server are trademarks of Informix Corporation or its affiliates, one or more of which may be registered in the U.S. or other jurisdictions.
Other company, product, and service names may be trademarks or service marks of others.
IBM copyright and trademark information
Informix copyright and trademark information

Marty Lurie started his computer career generating chads while attempting to write Fortran on an IBM 1130. His day job is in Systems Engineering for IBM's Informix organization, but if pressed he will admit he mostly plays with computers. His favorite program is the one he wrote to connect his Nordic Track to his laptop (the laptop lost two pounds, and lowered its cholesterol by 20%). Marty is an IBM-certified DB2 DBA, IBM-certified Business Intelligence Solutions Professional, and an Informix-certified Professional. He can be reached at lurie@us.ibm.com.
Comments (Undergoing maintenance)





