Level: Introductory Marty Lurie (lurie@us.ibm.com), Systems Engineer, IBM
24 Oct 2001 Marty Lurie describes how to use SQL to analyze the logs of a Web server to understand how people use the site and performance issues. Using Linux clusters with a shared nothing database can improve the performance of that processing. Examples are shown using IBM-Informix XPS Developers Kit.
©Copyright International Business Machines
Corporation 2001. All rights reserved.
"...every click you make, every .gif you take,
we'll be watching you..." -- Anonymous, and still in
hiding.
There is a wealth of information gathered by the Web server
(httpd) and stored in the log file. In this article we'll look at
how to use the logs to improve the performance and usability of the
web site. Unfortunately, many sites just archive or delete the
logs. I'll describe how to load the logs into a database and then
to query the database to see how people navigate the site and how
site performance can be improved.
I chose a Linux clustered database as a proof point of Linux
scalability and also to show that all the old Pentium machines in
the office could be put to good use. The idea here is to make a
pile of old hardware accomplish the work of a newer, faster
machine. A database server that runs on a cluster addresses one
aspect of clustered computing. Most of the Linux supercomputer
clusters focus on numeric problems. This cluster addresses the
commercial requirements of data storage and manipulation. In a
cluster, all the computers work on the same query at the same time.
The key question: as you add more computers, how much faster do you
get the answer? Inter-node communication, data distribution, and
load balancing are critical to cluster performance.
To demonstrate the load balancing issue I used the Parallel
Virtual Machine (PVM) software common to Beowulf clusters. I used
the IBM Informix(r) Extended Parallel Server™ (XPS) database for
testing the cluster database scalability.
Let's tackle the following:
- Understanding the Web logs and loading them into the
database
- Using Structured Query Language (SQL) to answer questions about
the Web site
- Measuring the database cluster performance as more hardware is
added
Understanding the Web logs and loading them into the
database
The log files for the Apache Web server are found in the
/var/log/httpd directory (all file names and locations are based on
RedHat 6.2). You'll find a number of logs, the most current one
being access_log. The default format of the file is:
192.168.1.142 - - [14/Nov/2000:16:27:21 -0500] "GET /time.html HTTP/1.1" 304 -
|
The first field is the IP address of the client. The "-" which
comes next is the rfc1413 identity check, which is rarely used
because of the overhead associated with using the identity check
capability. Next is any active .htaccess user id authentication--in
this case, there is none. Enclosed in square brackets ([ ]) is a
time stamp of the access. The request method, in this case a GET,
was for the url /time.html using HTTP/1.1. The 304 is the Status
Code for the access as documented in the RFC for HTTP.
The final "-" is the number of bytes sent -- none in this case. The
304 return code says the object has not been modified so no bytes
are sent.
Please pass the cookies:
The default Web server log provides lots of information, but it
can mask many individual browsers operating from within the same
Internet Service Provider (ISP). Cookies are the solution.
Consider the following three lines from an access_log file:
206.175.175.226 - - [15/Nov/2000:12:23:36 -0500] "GET / HTTP/1.0" 200 1233
206.175.175.226 - - [15/Nov/2000:12:24:15 -0500] "GET / HTTP/1.0" 200 1233
206.175.175.226 - - [15/Nov/2000:12:25:27 -0500] "GET / HTTP/1.0" 200 1233
|
It looks like one browser, identified by a single IP address,
has accessed the "/", or root index, page three times. However, if
logging is enabled with the following in the Apache httpd.conf
configuration file, the above hits tell a very different story
# add user tracking
CookieTracking on
CustomLog logs/clickstream "%{cookie}n %r %t"
|
With logging enabled, we can look in the file called
"clickstream" (as configured above) for the following log
information:
206.175.175.226.1017974309016881 GET / HTTP/1.0 [15/Nov/2000:12:23:36 -0500]
206.175.175.226.1028974309055577 GET / HTTP/1.0 [15/Nov/2000:12:24:15 -0500]
206.175.175.226.1017974309127275 GET / HTTP/1.0 [15/Nov/2000:12:25:27 -0500]
|
There were three completely different browsers accessing the
site. This tracking is accomplished using a unique identifier
(non-persistent cookie) for each browser. The number following the
IP address is a user-tracking cookie that is sent by the server to
the browser. Each time the browser makes a request, the cookie is
sent back to the server by the browser.
Because the default installation does not have user tracking
enabled, the examples in this article do not use CookieTracking.
You can jump in with your current logs and do this analysis. You
will want to turn on CookieTracking so all those clicks from "that
large ISP" will become individual streams instead of one very busy
user.
A note on privacy. A number of companies have been caught with
their "hand in the cookie jar," so to speak. A quick Internet
search yields 124,000 pages that contain the string "privacy" that
also contain the string "cookies". This is a topic that has
attracted some attention. Develop a privacy policy, post it to your
site, and live by it.
Now that we have a handle on the log file, we need to load it
into a database to run queries against the data. The log file
strings are not friendly to a database high-speed loader so we will
massage them into a delimited string. Dust off your favorite
Practical Extraction and Report Language. The code you can download
is in Perl. A quick disclaimer: my Perl code is not "good Perl
style." The distinguishing features are: it works, and it has at
least one comment line.
For the following input:
158.58.240.58 - - [04/Jul/2000:23:59:42 -0500] "GET / HTTP/1.1" 200 47507
203.127.32.40 - - [04/Jul/2000:23:59:25 -0500] "GET /jp/product/images/prod_top.gif HTTP/1.0"
304 -
203.127.32.40 - - [04/Jul/2000:23:59:24 -0500] "GET /jp/product/down.html HTTP/1.0" 200 12390
211.45.44.33 - - [04/Jul/2000:23:59:19 -0500] "GET /kr/images/top_banner.gif HTTP/1.1" 200
6719
211.45.44.33 - - [04/Jul/2000:23:59:24 -0500] "GET /kr/images/waytowin.gif HTTP/1.1" 200
71488
211.45.44.33 - - [04/Jul/2000:23:59:25 -0500] "GET /kr/train/new.gif HTTP/1.1" 200 416
|
The output from the Perl program looks like this:
158.58.240.58|-|-|2000-07-04 23:59:42||/|200|47507|-|1|
203.127.32.40|-|-|2000-07-04 23:59:25|jp|/jp/product/images/prod_top.gif|304||gif|4|
203.127.32.40|-|-|2000-07-04 23:59:24|jp|/jp/product/down.html|200|12390|html|3|
211.45.44.33|-|-|2000-07-04 23:59:19|kr|/kr/images/top_banner.gif|200|6719|gif|3|
211.45.44.33|-|-|2000-07-04 23:59:24|kr|/kr/images/waytowin.gif|200|71488|gif|3|
211.45.44.33|-|-|2000-07-04 23:59:25|kr|/kr/train/new.gif|200|416|gif|3|
|
The last two fields, added by the Perl script are: 1) the object
type, stripped from the end of the requested URL, and the 2) the
link depth, which is the number of "/" characters in the url.
Loading up the Database
To perform the analysis we'll use a relational database running
on a cluster. Relational databases are the most prevalent database
architecture. Data is presented to the user in table format. There
are a number of relational databases available for Linux, including
DB2(r) Universal Database and IBM-Informix XPS Developers Kit. My
examples use the XPS Developers Kit. XPS has implemented its own
message-passing layer rather than using the more typical Beowulf
implementations (for example, PVM, MPI). Feel free to use the
database of your choice.
If you want to test a cluster using the XPS Developers Kit, you
can get a copy by contacting Mohan Natraj (World Wide Marketing
Manager for XPS) at mogs@us.ibm.com. Mohan's phone number
is (503) 525-7419.
The installation procedure is well-documented in the Installation
Guide and the release notes/README file. To make life easier,
download the easyinst.tar
file which automates the installation process. By reading the shell
script you will see the exact steps to run to do the installation.
By reading the installation documentation, you'll understand why
the steps are needed.
Relational databases use Structured Query Language (SQL) to work
with the data. A tutorial
on Informix SQL is available if you need it. We need both data
definition language (DDL) to create tables and data manipulation
language (DML) to query the data in the tables.
First let's create a table to store the data. The SQL to do this
looks like:
-- comment lines explain the code and any unique XPSisms
-- create a database called weblog
-- the "in workslice.1" specifies the storage for the database
create database weblog in workslice.1;
-- a raw table is a non-logging table used for data warehousing
-- this is similar to a DB2 CREATE TABLE statement
-- with the option NOT LOGGED INITIALLY
create raw table webfact (
ip char(15) ,
foo char(1) ,
bar char(1) ,
timestamp datetime year to second,
location char(16) ,
url varchar(255) ,
http_code char(3) ,
bytecount int,
objtype char(6),
linkdepth int
)
-- the "fragment" clause specifies where to
-- put the data. In our case the "workslice"
-- is a disk set across the cluster
fragment by hash(ip) in workslice;
|
To run this SQL from a file:
dbaccess -e -weblogDDL.sql
|
Data loaders come in many different flavors. The XPS loader
takes the relational table concept and maps it to Linux files and
pipes. After the mapping has been defined with an external
table you can use SQL to load, query, and unload data. The DDL
for the external table is:
create external table e_webfact
-- the "sameas" saves re-typing the above DDL
sameas webfact
using (format "delimited",
REJECTFILE "/tmp/webfact.barf",
MAXERRORS 50000,
datafiles ("disk:1:/home/informix/work/dat/weblog.unl"));
|
Loading from the flat file of data defined in the "datafiles"
clause is as simple as an insert-select. This is our first example
of a DML statement.
-- now load
insert into webfact select e_webfact.* from e_webfact;
|
With data in the database we can now run queries to answer
questions about the Web site.
Using SQL to answer questions about the Web site
Let's answer the following questions about the Web site with
SQL:
- What are the most popular objects?
- Which objects consume the greatest site bandwidth?
- How many errors are generated and what is their nature?
- What is the distribution of activity over time?
- What links are bookmarked, bypassing the home page?
This is only a sample of the questions that can be answered. The
logs used for the examples are samples from www.informix.com before
the latest web site revisions.
1) What are the most popular objects?
The most popular objects belong on the home page, or a link to
them belongs on the home page, to make navigation and access
easier. The query is an easy one:
select
url,
count(*) hits
from webfact
group by 1
order by 2 desc;
|
What is the most popular link at Informix for one sample log
file? The envelope please:
url /
hits 242
url /informix/images/dot_clr.gif
hits 208
url /informix/images/blackbar.gif
hits 181
url /informix/absolut.css
hits 170
|
2) Which objects consume the most bandwidth?
The objects with the largest value of hits *
object size consume the most site bandwidth. Why do we care?
If a large 24-bit color image is eating most of the bandwidth, you
might want to reduce the size or quality of the image to improve
performance.
The query we use to find out which objects are consuming the
most bandwidth is more complex and creates a temporary table on the
fly:
-- use the maximum memory available
set pdqpriority 100;
-- if another query is updating the table don't block
set isolation to dirty read;
-- drop the temporary table if it exists
drop table foo;
select
url,
max( bytecount) mbytes,
count(*) hits
from webfact
group by 1
-- create a temporary table, called foo, on the fly to store the
query output into temp foo;
select
url,
mbytes * hits totalbytes
from foo
order by 2 desc;
|
A sample from the Informix site yields:
url /answers/english/docs/220sdk/5194.pdf
totalbytes 186935606
url /answers/english/docs/912ius/4818.pdf
totalbytes 144750480
url /answers/english/docs/gn7382/4372.pdf
totalbytes 106828470
url /answers/english/docs/73ids/4354.pdf
totalbytes 56389410
|
3) How many errors are generated and what is their
nature?
The first query below is a good quality-check of the site. The
second query provides a list of urls and their errors in descending
order, which gives a prioritized list of what to fix.
-- show summary of errors
select
http_code,
count(*) hits
from webfact
group by 1
order by 2 desc;
|
| http_code | class="c11">(count(*)) | | | 200 | 1478884 | ok | | 304 | 452508 | use cached copy | | 404 | 163149 | not found | | 302 | 55476 | moved temporarily | | 206 | 42576 | partial content | | 500 | 7157 | internal server err | | 401 | 4166 | unauthorized | | 123 | ? | | 403 | 25 | forbidden | | 503 | 17 | service unavailable |
-- now look at the detail
select
url,
http_code,
count(*) hits
from webfact
group by 1,2
order by 3 desc;
|
I'll spare you the output from this one. It is only interesting
if it comes from your own site.
4) What is the distribution of activity over
time?
The international community requires that this site is available
7x24. The distribution of hits over time illustrates the demand for
high availability in the Web environment.
select
-- get the hour of the access
extend ( timestamp , hour to hour ) hour_x_hit,
count(*) hits
from webfact
group by 1
order by 1 asc;
|
A graphical front end can make all the difference when
presenting information. This graph, based on the above query, shows
that the site is 7x24:
Figure 1. HTTP Hits by Time
5) What links are bookmarked, bypassing the home
page?
If you build it they will come. And maybe even return for
another visit. Browser bookmarks (or "favorites" if you use "that"
browser) are your visitors' votes about the best content on the
site. This content should be easily accessible. Putting links to
the best content on the home page makes the home page more of a
portal than an annoyance to avoid. Any customization or new
information that is presented on the home page should be propagated
to the popular pages to reach the returning bookmark audience.
The bookmark SQL is more complicated. We want the first object
selected for a unique session. The UserTrack directive would be
very helpful, but since this isn't available in the data we'll do
the best we can. The comments in the code explain the stages of the
query.
set isolation to dirty read;
set pdqpriority 90;
set explain on;
-- table to store bookmark results
--drop table bkmarks;
create raw table bkmarks (
ip char (15),
tstamp datetime year to second,
linkdepth int,
url varchar(255)
)
fragment by hash ( url ) in workslice;
select
distinct ip,
min( timestamp) tstamp
from webfact
-- filter for html pages, not .gif, .jpg ...
where objtype matches 'ht*'
group by 1
-- create a temporary table for the results
into temp foo;
-- now put a summary into the bookmarks table
insert into bkmarks
select
w.ip,
w.timestamp,
w.linkdepth,
w.url
-- notice that two tables are combined in the query
from webfact w,
foo f
-- the where clause allows the specification of how to
-- match-up the two tables and to do any desired filtering
where
w.ip = f.ip
and w.timestamp = f.tstamp
and objtype matches 'ht*'
;
select url, count(*) bookmark_count from bkmarks
group by 1
order by 2 desc
|
All the analysis so far can be performed on any database. Now
we'll look at the server used for these examples and how it can be
clustered.
Linux Clusters
Computer clusters use multiple computers to work on the same
problem. There are two main reasons to cluster computers:
performance and availability. Correspondingly, there are two
dominant architectures for clusters: shared nothing architecture or
a shared disk architecture. A shared nothing architecture is well
suited to both tasks of performance and availability, a shared disk
architecture lends itself more to addressing availability issues
with limited scalability.
"Shared huh?" you might ask. The "sharing" refers to how the CPU
and disk are allocated among the computers. The cluster of
computers must define how they will work on the same task at the
same time. In a shared nothing architecture, the computers all are
assigned the problem to be solved and then work individually on
data they control until they can return a result set to the user
who launched the query. Each computer operates on the disk, memory,
CPU, and data assigned to them. Conversely, a shared disk (or
distributed lock manager) cluster uses a lock mechanism to
arbitrate the cluster computers requests to a common disk
repository of data. There is much debate about which architecture
is better. I lean very strongly in favor of shared-nothing. An
example of a shared-nothing supercomputer is the Search for
Extraterrestrial Intelligence (SETI@home). Shared nothing databases
are commercially available from IBM and Teradata. A distributed
lock manager database (Cache Fusion) is available from Oracle. See
Blair Adamache's article,
Clustering for Scalability, for more on this topic.
Many Linux clusters are called Beowulf clusters. Rather than
attempt a definition I offer you a quote from the FAQ (see http://www.beowulf.org/ for more
information):
1. What's a Beowulf? [1999-05-13]
It is a kind of high-performance massively parallel computer
built primarily out of commodity hardware components, running a
free-software operating system like Linux or FreeBSD, and
interconnected by a private high-speed network.
2. Where can I get the Beowulf software? [1999-05-13]
There is no software package called "Beowulf." There are,
however, several pieces of software many people have found useful
for building Beowulfs. None of them are essential. They include
MPICH, LAM, PVM, the Linux kernel, and the channel-bonding patch to
the Linux kernel.
A Beowulf Example - Load Balancing Issues
Before we jump into a database cluster lets take a look at a
simple two-node Beowulf cluster using Parallel Virtual Machine
(PVM) with an X-windows interface (XPVM.) The two nodes are
asymmetric: one is a 133 MHz Pentium, the second is a 500 MHz
Pentium III. The compute task is trivial: gzcat a compressed file
and direct the output to /dev/null. The display shows an
interesting facet of cluster computing: there is no load balancing
inherent in the cluster.
Figure 2. Cluster computing
The bar chart shows the "piecrust" computer (133 MHz) still
working with an elapsed time of 5.7 seconds, while the "poobah"
server (500 MHz) has already finished. If the final result is
dependent on the result set from each node, the throughput is
throttled by the slowest computer. The same issue arises if the
amount of work assigned to a symmetrical cluster is skewed towards
a subset of the nodes.
The Database Cluster
My cluster for the click stream analysis was a very low budget
affair. I used all the old Pentium 233 MHz machines in the office I
could scrounge, IBM-Informix XPS for Linux, and a 100 megabit
ethernet switch. I purloined one 366 MHz machine, but as explained
above, it was limited by the rest of the herd. The illustration
shows two of the machines, the monitor, and the ethernet
switch.
Figure 3. Physically clustered machines with penguins
Does it Scale?
I focused the cluster testing on performance scalability. How
much additional performance is achieved by adding additional
processors? The perfect case is linear (100%) scalability. There
are some confounding variables. Moving from one processor to two
processors introduces the communications layer between the nodes,
sending the queries to the nodes (function shipping), and
consolidating the result set to the requestor node. This tends to
reduce scalability. This is a one-time penalty when moving from a
single node to multiple nodes. Working to increase scalability is
"super-linear speedup." As nodes are added more Random Access
Memory (RAM) is available for in-memory data structures. RAM is so
much faster than disk that 2 * n nodes might result in job
times shorter than T / 2. Why? When the total data is much larger
than the available RAM, the database will use disk space to store
temporary results -- much like virtual memory paging in an
operating system. As nodes are added to the cluster, their RAM
becomes available to process the query. If enough nodes are added
to allow all the data structures to reside in the RAM of all the
machines, the swapping is avoided. Things speed up quite a bit when
this happens.
The graph below shows run times for two, three, four, and five
processors. The results are nearly linear. True 100% scalability
would produce straight lines with constant slope. There is a mixed
query workload and not all queries provide the same scalability.
These tests were performed with a 100 megabit switch. The test case
with five computers shows a slowdown on some queries. The fifth
computer had no local data disk attached, so some of the SQL caused
enough traffic to overload the 100 megabit switch.
Figure 4. Test cycle of coservers
It is easy to test the impact of network speed on cluster
performance. I substituted a 10 megabit hub for the switch and
re-ran the same suite of tests. Here we see ethernet at its worst.
We have near linear slowdown instead of speed-up. The more packets
flying around the wire, the more collisions and the slower we go.
The Beowulf FAQ refers to a channel-bonding patch to make multiple
Network Interface Cards (NICs) function as a single high-speed
interface. This graph dramatically demonstrates why a high-speed
connection is critical to the cluster.
Figure 5. Test cycle of 2, 3 and 4 nodes
Conclusions
Web log analysis can make significant performance and usability
improvements to your Web site. Linux database clusters have
tremendous potential. Please do try this at home - although
your mileage may vary!
As you tune for performance (see Performance
Guide), remember that there is now an additional major resource
to measure and optimize: the normal CPU, disk, memory, and now
interconnect. An easy way to get going on a two-node system without
the expense of a 100 megabit switch (about $100) is a crossover
cable (about $10). The cable will connect two machines to each
other without a hub/switch.
Please let me know about your Linux database cluster experiences
by contacting me at lurie@us.ibm.com (this is not a tech
support email).
DB2 and IBM are trademarks or registered trademarks
of the IBM Corporation in the United States and/or other
countries.
Informix and Informix Extended Parallel Server are trademarks of
Informix Corporation or its affiliates, one or more of which may be
registered in the U.S. or other jurisdictions.
Other company, product, and service names may be trademarks or
service marks of others.
IBM copyright
and trademark information
Informix
copyright and trademark information
About the author  | 
|  | Marty Lurie started his computer career generating chads while
attempting to write Fortran on an IBM 1130. His day job is in
Systems Engineering for IBM's Informix organization, but if pressed
he will admit he mostly plays with computers. His favorite program
is the one he wrote to connect his Nordic Track to his laptop (the
laptop lost two pounds, and lowered its cholesterol by 20%). Marty
is an IBM-certified DB2 DBA, IBM-certified Business Intelligence
Solutions Professional, and an Informix-certified Professional. He can be reached at lurie@us.ibm.com. |
Rate this page
|