 | Level: Intermediate Martin Streicher (martin.streicher@gmail.com), Editor-in-Chief, McClatchy Interactive
04 Mar 2008 Stretch the capacity of your Web server farm with PHP and a reverse proxy, such as Varnish.
The history of the World Wide Web may be short, but its virtual landscape is already
littered with scads of digital dross. The tarnished logos of so many failed dot-coms
are strewn to and fro, discarded (or repossessed) servers sit idle, collecting dust,
and almost everyone from Silicon Valley to Silicon Alley has a tall tale to tell. "Why,
when I was just a lad, we didn't have fancy WYSIWYG editors. We hand-coded our HTML,
and we liked it! Ah, those were the days of baud, boy!"
 |
Frequently used acronyms
CSS: Cascading Style Sheets
HTML: Hypertext Markup Language
HTTP: Hypertext Transfer Protocol
RAM: Random Access Memory
|
|
Thankfully, much has changed from those heady times of the mid-1990s. Designers have
fancy tools to create Web sites, as do developers. Scripting languages, including PHP,
are conveniences, and frameworks like CakePHP (see Resources) accelerate all stages of coding. Sites have also learned
how to scale to keep pace with demand. Need more bandwidth? Lease a bigger pipe.
Need to run faster? Crank up the clock cycles. Need to push more pages? Deploy more Web servers.
Yet more servers? Perhaps. If you have cash to burn.
In fact, you can scale a site many ways, and multiplying servers is but one (albeit
often practical and necessary) approach. Another technique reallocates existing servers
to defuse overwhelming incoming traffic. The kernel of the idea: Why generate a page
anew again and again? There are many cases in which a generated page can live for
seconds, even longer. The trick is to keep the page handy when the second, third, and
10,000th visitor visits its URL.
Here, I combine PHP with smart software called a reverse proxy to cache pages and save servers.
Why a reverse proxy?
Like your computer's memory cache — or like the PHP opcode cache
— a reverse proxy eliminates rework and hastens the delivery of oft-requested data.
Specifically, a reverse proxy intercedes between Web clients and your Web server to
capture each incoming HTTP request and its respective HTTP response. Then, given a
repertoire of requests and matching responses, the reverse proxy can act as if it were
the genuine Web server. In some instances, the reverse proxy may simply pass an
incoming request through to the Web server. But in other cases, a reverse proxy can
choose to process the request itself.
Think of a cached HTTP response as a form letter: The reverse proxy simply sends a form
letter in response to a particular request. The second and third (etc.) request for an
asset (page or image, for example) receives the same response as the original request.
(An example exchange is shown in Figure 1.)
Figure 1. Hypothetical reverse proxy shares same response with many clients
Figure 2 depicts the relationships between client, server, and the reverse proxy. The
Web client — Firefox or Safari, for example — connects to the
public-facing Web "server" on port 80. In fact, the "server" is actually a reverse
proxy. Only the proxy can connect to the actual Web server through port 2001. If the
proxy can't fulfill a request or isn't permitted to fulfill a request because of caching
rules (discussed momentarily), the proxy defers to the Web server. Again, depending on
mandates and the type of request, the proxy may cache the response and forward it to the client.
Figure 2. Relationship between Web client, reverse proxy, and Web server
In addition to expediting delivery, the reverse-proxy caching scheme provides many
other benefits, including:
- The cache that the reverse proxy maintains and serves can unburden your Web
application. The computation required to generate a response is performed more
infrequently because common requests are fulfilled in the interim by the cache.
- The reverse proxy can unencumber your database server. Most Web applications depend on
a database, so each request requires one or more queries. Fewer requests translate to
fewer queries and a snappier database server. In fact, if the paucity of database connections or the laggardly response of your
database server is systemic, embedding one or more reverse proxies could beget widespread gain.
- The cache can countervail server load. Less code and queries improve throughput.
Memory is the best persistent store for a cache because access time is (practically)
instantaneous and RAM is typically plentiful (or cheap to amass). However, the file
system can also act as a cache store. It's vastly more abundant and affordable than memory, albeit far slower to access.
Of course, assets can and do change rapidly on the Web, and cached assets eventually
become stale or out of date. Each request and response can specify its own "freshness
date," and the cache expires each datum on schedule, usually a few seconds after its capture.
The contents of the cache are manipulated through HTTP headers
— the
preamble of each HTTP request and response. Headers can set the expiry for an asset and
can subvert caching. (The complex, subtle, and sometimes contradictory caching
strategies of the client, server, proxy, and other agents is beyond the scope of this
article.) Indeed, HTTP cache control is part of the HTTP V1.1 protocol specification
(see Resources).
The cache-control header
The quote below provides a snippet of Section 13.1.3 of the HTTP 1.1 protocol
specification (see Resources). Note that the emphasis on the
word MUST isn't editorial commentary: The all caps is part of the specification.
"In some cases, a server or client might need to provide explicit directives to
the HTTP caches. We use the Cache-Control header for this
purpose. The Cache-Control header allows a client or server
to transmit a variety of directives in either requests or responses. The Cache-Control directives MUST be obeyed by all caching mechanisms
along the request/response chain, and MUST pass through [all] proxy [and] gateway
applications. As a general rule, if there is any apparent conflict between header
values, the most restrictive interpretation is applied (that is, the one that is most
likely to preserve semantic transparency)."
Hence, if a response contained Cache-Control: no-cache, a
reverse proxy or other form of intermediary cannot provide the cached response
to another request without re-validating the response with the originating server. As
another example, if a request contains Cache-Control:
max-age=60, the client is declaring its unwillingness to accept a response older than 60 seconds.
Here are some other helpful directives:
- The directive
public allows the response to be cached
everywhere possible. In contrast, private dictates that the
response can only be stored in the client's cache (typically, a browser's cache).
- The directive
must-revalidate forces all caches to validate
responses. A response remains viable if the server returns a 304
Not Modified response, indicating that no change is available. Otherwise, the
server returns a new complete response, which must replace what was persisted. A
variant, the proxy-revalidate directive, requires public caches to validate.
Note: All the cache-control directives can be found in Section 14.9 of the HTTP
1.1 specification (see Resources).
Directives can also be combined, as in Cache-Control: public,max-age=30.
 |
Why the meta-tag isn't enough
HTML provides two meta-tags —
meta http-equiv="Expires" and meta
http-equiv="Pragma"
— to control how the browser caches content. The tags
are easily embedded in any header, much like the following.
<meta http-equiv="Expires"
content="Wed, 26 Dec 2007 19:50:49 GMT" />
<meta http-equiv="Pragma"
content="no-cache" />
|
However, neither tag is interpreted by proxies and caches because neither entity parses
content. To control the destiny of your pages, use the proper HTTP V1.1 headers.
|
|
Two other headers are used in tandem with Cache-Control to control retention:
- The
Expires header specifies a Greenwich Mean Time (GMT)
expiry. If the accompanying Cache-Control header permits
caching, Expires controls how long the asset can be cached.
However, if max-age (or corollary s-maxmag for public caches) is set, its value overrules Expires.
- The
Last-Modified header indicates when the asset last
changed — again, in GMT. As mentioned, this header is used very often to validate the contents of a cache.
Hence, for purposes of caching assets that PHP generates, you must set one or more of
the headers Cache-Control, Expires, or Last-Modified. (To learn why
you cannot control a cache sufficiently from HTML, see "Why the
meta-tag isn't enough.")
Building and installing Varnish
To watch the HTTP headers in action, let's build, install, and run an HTTP reverse
proxy to cache the output of a small PHP application. Varnish is a relatively new, but
very capable, high-performance HTTP reverse proxy. (To learn more, read about Varnish's
construction at the Varnish wiki (see Resources.) Varnish also
provides monitors and a complete scripting language, Varnish Configuration Language
(VCL), to fine-tune behaviors. For instance, the code snippet below directs Varnish to
cache certain file types that typically represent static content:
sub vcl_recv {
if (req.request == "GET" && req.url ~ "\.(gif|jpg|swf|css|js)$") {
lookup;
}
}
|
Like most open source packages, Varnish builds readily on several platforms, including
FreeBSD, Linux®, and Mac OS X. Varnish is also available in binary form for
several systems, if you prefer using a package manager, such as apt or port. This article is based on
Varnish V1.1.2, the latest release as of December 2007.
To begin, download the source code from the Varnish Web site (see Resources), unpack the compressed TAR file, and change to the newly
created varnish-1.1.2 directory. Next, run the scripts ./autogen.sh and ./configure, in
that order. (The assumptions of the ./configure script are usually reasonable. However,
to customize the build to suit the specifics of your system, type ./configure --help to see which options are tunable. For example, if
your system keeps locally built binaries in /opt/local rather than /usr/local, you'd
run ./configure --prefix=/opt/local.) Finally, type make && sudo make install.
Listing 1. Building and installing
$ ./autogen.sh
+ aclocal
+ glibtoolize --copy --force
+ autoheader
+ automake --add-missing --copy --foreign
+ autoconf
$ ./configure
checking build system type... powerpc-apple-darwin9.1.0
checking host system type... powerpc-apple-darwin9.1.0
checking target system type... powerpc-apple-darwin9.1.0
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
...
config.status: config.h is unchanged
config.status: executing depfiles commands
$ make && sudo make install
...
/usr/bin/install -c .libs/varnishd /usr/local/sbin/varnishd
...
|
After two or three minutes of compiling C code, the build
process copies a handful of binaries to your local utilities directory. (Notice that
varnishd is copied to /usr/local/sbin by convention because it's a system utility.)
$ ls /usr/local/*bin/v*
/usr/local/bin/varnishadm /usr/local/bin/varnishreplay
/usr/local/bin/varnishhist /usr/local/bin/varnishstat
/usr/local/bin/varnishlog /usr/local/bin/varnishtop
/usr/local/bin/varnishncsa /usr/local/sbin/varnishd
|
The file varnishd, as its name implies, is the Varnish daemon — a perennial
service that caches and serves content from memory. The other utilities listed above
control and monitor the operation of varnishd. For instance, varnishstat continuously
feeds you Varnish statistics. The file varnishadm lets you send administrative commands to varnishd while it's running.
Out of the box, varnishd does not cache a response with a cookie, nor does it honor the
Cache-Control directives private
and no-cache. Luckily, a little VCL fixes the problem, as
shown in Listing 1. Code provided by Jean-François Bustarret (see Resources).
Listing 2. A fragment of VCL to conform with PHP
backend default {
set backend.host = "127.0.0.1";
set backend.port = "80";
}
sub vcl_recv {
if (req.request != "GET" && req.request != "HEAD") {
pipe;
}
if (req.http.Expect) {
pipe;
}
if (req.http.Authenticate) {
pass;
}
if (req.http.Cache-Control ~ "no-cache") {
pass;
}
lookup;
}
sub vcl_fetch {
if (!obj.valid) {
error;
}
if (!obj.cacheable) {
pass;
}
if (obj.http.Set-Cookie) {
pass;
}
if (obj.http.Pragma ~ "no-cache"
|| obj.http.Cache-Control ~ "no-cache"
|| obj.http.Cache-Control ~ "private") {
pass;
}
insert;
}
|
Let's march through the code:
- The section
backend default specifies which server to
connect to if no command-line option (-b hostname:port
) is given.
- The function
vcl_recv() is called when the daemon receives
a client request. Conversely, vcl_fetch() is called when the
requested object has been retrieved from the actual Web server or the request to the
Web server has failed. As written, vcl_fetch() also rejects
caching if either the Cache-Control or Pragma header is set to no-cache.
- In the code, the operation
pass implies "pass through," or
do nothing for this individual request/response exchange. pipe also passes data through from client to server untouched, but
does so for every subsequent transaction between the client and server. (pipe is a continuous uninterrupted pass
until either end closes the connection.) lookup tries to
find the response in the cache, while insert adds a response to the cache.
To continue, save the contents to a file — say,
/usr/local/etc/varnish/php.vcl — and start varnishd with the command:
$ sudo varnishd -a localhost:8080 \
-f /usr/local/etc/varnish/php.vcl
|
After a few moments, you should see output that resembles the following:
file ./varnish.einpln (unlinked)
size 1069547520 bytes (1020 fs-blocks, 261120 pages)
Creating new SHMFILE
|
The varnishd daemon is now ready for connections. From a terminal window, run
varnishstat. As shown in Figure 3, varnishstate reveals that the daemon is running (the
runtime is shown at top left), although no activity has been recorded yet. The number of
free bytes in the cache can be found at bottom.
Figure 3. varnishstat shows caching and connection activity
Next, generate some activity. Connect to port 8080 and browse your Web site. Watch the
varnishd monitor closely. Do any pages or assets appear in the cache?
Make your PHP cache-friendly
Before you continue, download and install the latest version of Firefox. When
installed, launch Firefox and visit the add-ons page to install the Live HTTP Headers
plug-in (see Resources). Among its many tricks, Live HTTP
Headers shows you the HTTP headers of every incoming response. (You can filter out
requests for image and CSS files, if you like.) Restart Firefox when prompted and open the Live HTTP Headers window.
Save the code in Listing 3 so your Web server can find it and point Firefox to the
address of the new PHP page through the reverse proxy. For instance, if the URL was
http://www.example.com/misc/cache.php, you'd likely point to
http://localhost:8080/misc/cache.php. Visit the same URL from many browsers and through
wget. You should see active use of the cache. Pause 20
seconds between requests; you should see an obvious cache miss because this content has
expired per its outgoing headers.
Listing 3. Sample PHP code to manipulate the cache
<?php
//
// Emit headers before any other content
//
cache_control( "public,max-age=10");
expires( to_gmt( time() + 10 ) );
?>
<html>
<head>
</head>
<body>
<?
print to_gmt();
?>
</body>
</html>
<?php
function to_gmt( $now = null ) {
return gmdate( 'D, d M Y H:i:s', ( $now == null ) ? time() : $now );
}
function last( $gmt ) {
header("Last Modified: $gmt");
}
function expires( $gmt ) {
header("Expires: $gmt");
}
function cache_control( $options ) {
header("Cache-Control: $options");
}
?>
|
Admittedly, this tiny application is only slightly more useful than Hello World, but it
demonstrates all that's required to cooperate with Varnish and other reverse proxies.
Even if you generate the bulk of your pages dynamically, those pages can likely be
cached for a few seconds at least, sparing your server in the meantime. In contrast,
you may want to preclude caching for other pages. Now you know the rules and have the
software — Varnish — to realize a very valuable optimization.
Prepare for success
On the modern World Wide Web, most pages are no longer hand-coded. Instead, they are
generated by an application and delivered on demand, making each page custom and
personal.
But "There ain't no such thing as a free lunch." The time and effort
required to craft HTML merely shifted from man to machine. And although a machine may
be faster by orders of magnitude, it is nonetheless a finite resource. The smart PHP
developer respects this certitude. The smart PHP developer plans to scale. Database
queries are efficient, servers are made redundant, memory is used efficiently. And now,
you can cache your dynamic pages. Bring on the traffic!
Resources Learn
-
Read other articles in the series "Make
PHP apps fast, faster, fastest."
-
Learn more about Varnish's construction at
the Varnish wiki.
-
Learn more about Section 13.1.3 and Section 14.9 of the HTTP V1.1 protocol specification.
-
Read the HTTP V1.1 protocol specification for
more information about HTTP headers and the behavior of public intermediary and private caches.
-
Discover how Varnish was constructed.
-
Visit Jean-François Bustarret's blog for
more information about VCL and PHP.
-
PHP.net is the central resource for PHP developers.
-
Check out the "Recommended PHP reading list."
-
Browse all the PHP content on developerWorks.
-
Expand your PHP skills by checking out IBM developerWorks' PHP project resources.
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Using a database with PHP? Check out the Zend Core for
IBM, a seamless, out-of-the-box, easy-to-install PHP development and production environment that supports IBM DB2 V9.
-
Stay current with developerWorks' Technical events and webcasts.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
Get products and technologies
-
Download the Varnish source code.
-
Explore the PECL repository, your first stop for all
known extensions and hosting facilities for downloading and developing PHP extensions.
-
Download the add-ons for
Firefox at the Live HTTP Headers Firefox plug-in.
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
About the author  | |  | Martin Streicher is chief technology officer for McClatchy Interactive, editor-in-chief of Linux Magazine, a Web developer, and a regular contributor to developerWorks. He earned a master's degree in computer science from Purdue University and has been programming UNIX-like systems since 1986. |
Rate this page
|  |