Scaling PHP applications with Varnish

Deploy a reverse proxy to stretch your server capacity

Stretch the capacity of your Web server farm with PHP and a reverse proxy, such as Varnish.

Share:

Martin Streicher (martin.streicher@gmail.com), Editor-in-Chief, McClatchy Interactive

Martin Streicher is chief technology officer for McClatchy Interactive, editor-in-chief of Linux Magazine, a Web developer, and a regular contributor to developerWorks. He earned a master's degree in computer science from Purdue University and has been programming UNIX-like systems since 1986.


developerWorks Contributing author
        level

04 March 2008

Also available in Japanese

The history of the World Wide Web may be short, but its virtual landscape is already littered with scads of digital dross. The tarnished logos of so many failed dot-coms are strewn to and fro, discarded (or repossessed) servers sit idle, collecting dust, and almost everyone from Silicon Valley to Silicon Alley has a tall tale to tell. "Why, when I was just a lad, we didn't have fancy WYSIWYG editors. We hand-coded our HTML, and we liked it! Ah, those were the days of baud, boy!"

Frequently used acronyms


CSS: Cascading Style Sheets
HTML: Hypertext Markup Language
HTTP: Hypertext Transfer Protocol
RAM: Random Access Memory

Thankfully, much has changed from those heady times of the mid-1990s. Designers have fancy tools to create Web sites, as do developers. Scripting languages, including PHP, are conveniences, and frameworks like CakePHP (see Resources) accelerate all stages of coding. Sites have also learned how to scale to keep pace with demand. Need more bandwidth? Lease a bigger pipe. Need to run faster? Crank up the clock cycles. Need to push more pages? Deploy more Web servers.

Yet more servers? Perhaps. If you have cash to burn.

In fact, you can scale a site many ways, and multiplying servers is but one (albeit often practical and necessary) approach. Another technique reallocates existing servers to defuse overwhelming incoming traffic. The kernel of the idea: Why generate a page anew again and again? There are many cases in which a generated page can live for seconds, even longer. The trick is to keep the page handy when the second, third, and 10,000th visitor visits its URL.

Here, I combine PHP with smart software called a reverse proxy to cache pages and save servers.

Why a reverse proxy?

Like your computer's memory cache — or like the PHP opcode cache— a reverse proxy eliminates rework and hastens the delivery of oft-requested data.

Specifically, a reverse proxy intercedes between Web clients and your Web server to capture each incoming HTTP request and its respective HTTP response. Then, given a repertoire of requests and matching responses, the reverse proxy can act as if it were the genuine Web server. In some instances, the reverse proxy may simply pass an incoming request through to the Web server. But in other cases, a reverse proxy can choose to process the request itself.

Think of a cached HTTP response as a form letter: The reverse proxy simply sends a form letter in response to a particular request. The second and third (etc.) request for an asset (page or image, for example) receives the same response as the original request. (An example exchange is shown in Figure 1.)

Figure 1. Hypothetical reverse proxy shares same response with many clients
Hypothetical reverse proxy shares same response with many clients

Figure 2 depicts the relationships between client, server, and the reverse proxy. The Web client — Firefox or Safari, for example — connects to the public-facing Web "server" on port 80. In fact, the "server" is actually a reverse proxy. Only the proxy can connect to the actual Web server through port 2001. If the proxy can't fulfill a request or isn't permitted to fulfill a request because of caching rules (discussed momentarily), the proxy defers to the Web server. Again, depending on mandates and the type of request, the proxy may cache the response and forward it to the client.

Figure 2. Relationship between Web client, reverse proxy, and Web server
Relationship between Web client, reverse proxy, and Web server

In addition to expediting delivery, the reverse-proxy caching scheme provides many other benefits, including:

  • The cache that the reverse proxy maintains and serves can unburden your Web application. The computation required to generate a response is performed more infrequently because common requests are fulfilled in the interim by the cache.
  • The reverse proxy can unencumber your database server. Most Web applications depend on a database, so each request requires one or more queries. Fewer requests translate to fewer queries and a snappier database server. In fact, if the paucity of database connections or the laggardly response of your database server is systemic, embedding one or more reverse proxies could beget widespread gain.
  • The cache can countervail server load. Less code and queries improve throughput.

Memory is the best persistent store for a cache because access time is (practically) instantaneous and RAM is typically plentiful (or cheap to amass). However, the file system can also act as a cache store. It's vastly more abundant and affordable than memory, albeit far slower to access.

Of course, assets can and do change rapidly on the Web, and cached assets eventually become stale or out of date. Each request and response can specify its own "freshness date," and the cache expires each datum on schedule, usually a few seconds after its capture.

The contents of the cache are manipulated through HTTP headers— the preamble of each HTTP request and response. Headers can set the expiry for an asset and can subvert caching. (The complex, subtle, and sometimes contradictory caching strategies of the client, server, proxy, and other agents is beyond the scope of this article.) Indeed, HTTP cache control is part of the HTTP V1.1 protocol specification (see Resources).


The quote below provides a snippet of Section 13.1.3 of the HTTP 1.1 protocol specification (see Resources). Note that the emphasis on the word MUST isn't editorial commentary: The all caps is part of the specification.

"In some cases, a server or client might need to provide explicit directives to the HTTP caches. We use the Cache-Control header for this purpose. The Cache-Control header allows a client or server to transmit a variety of directives in either requests or responses. The Cache-Control directives MUST be obeyed by all caching mechanisms along the request/response chain, and MUST pass through [all] proxy [and] gateway applications. As a general rule, if there is any apparent conflict between header values, the most restrictive interpretation is applied (that is, the one that is most likely to preserve semantic transparency)."

Hence, if a response contained Cache-Control: no-cache, a reverse proxy or other form of intermediary cannot provide the cached response to another request without re-validating the response with the originating server. As another example, if a request contains Cache-Control: max-age=60, the client is declaring its unwillingness to accept a response older than 60 seconds.

Here are some other helpful directives:

  • The directive public allows the response to be cached everywhere possible. In contrast, private dictates that the response can only be stored in the client's cache (typically, a browser's cache).
  • The directive must-revalidate forces all caches to validate responses. A response remains viable if the server returns a 304 Not Modified response, indicating that no change is available. Otherwise, the server returns a new complete response, which must replace what was persisted. A variant, the proxy-revalidate directive, requires public caches to validate.

Note: All the cache-control directives can be found in Section 14.9 of the HTTP 1.1 specification (see Resources).

Directives can also be combined, as in Cache-Control: public,max-age=30.

Two other headers are used in tandem with Cache-Control to control retention:

  • The Expires header specifies a Greenwich Mean Time (GMT) expiry. If the accompanying Cache-Control header permits caching, Expires controls how long the asset can be cached. However, if max-age (or corollary s-maxmag for public caches) is set, its value overrules Expires.
  • The Last-Modified header indicates when the asset last changed — again, in GMT. As mentioned, this header is used very often to validate the contents of a cache.

Hence, for purposes of caching assets that PHP generates, you must set one or more of the headers Cache-Control, Expires, or Last-Modified. (To learn why you cannot control a cache sufficiently from HTML, see "Why the meta-tag isn't enough.")


Building and installing Varnish

To watch the HTTP headers in action, let's build, install, and run an HTTP reverse proxy to cache the output of a small PHP application. Varnish is a relatively new, but very capable, high-performance HTTP reverse proxy. (To learn more, read about Varnish's construction at the Varnish wiki (see Resources.) Varnish also provides monitors and a complete scripting language, Varnish Configuration Language (VCL), to fine-tune behaviors. For instance, the code snippet below directs Varnish to cache certain file types that typically represent static content:

sub vcl_recv {
  if (req.request == "GET" && req.url ~ "\.(gif|jpg|swf|css|js)$") {
    lookup;
  }
}

Like most open source packages, Varnish builds readily on several platforms, including FreeBSD, Linux®, and Mac OS X. Varnish is also available in binary form for several systems, if you prefer using a package manager, such as apt or port. This article is based on Varnish V1.1.2, the latest release as of December 2007.

To begin, download the source code from the Varnish Web site (see Resources), unpack the compressed TAR file, and change to the newly created varnish-1.1.2 directory. Next, run the scripts ./autogen.sh and ./configure, in that order. (The assumptions of the ./configure script are usually reasonable. However, to customize the build to suit the specifics of your system, type ./configure --help to see which options are tunable. For example, if your system keeps locally built binaries in /opt/local rather than /usr/local, you'd run ./configure --prefix=/opt/local.) Finally, type make && sudo make install.

Listing 1. Building and installing
$ ./autogen.sh
+ aclocal
+ glibtoolize --copy --force
+ autoheader
+ automake --add-missing --copy --foreign
+ autoconf

$ ./configure
checking build system type... powerpc-apple-darwin9.1.0
checking host system type... powerpc-apple-darwin9.1.0
checking target system type... powerpc-apple-darwin9.1.0
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
...
config.status: config.h is unchanged
config.status: executing depfiles commands

$ make && sudo make install 
...
/usr/bin/install -c .libs/varnishd /usr/local/sbin/varnishd
...

After two or three minutes of compiling C code, the build process copies a handful of binaries to your local utilities directory. (Notice that varnishd is copied to /usr/local/sbin by convention because it's a system utility.)

$ ls /usr/local/*bin/v*
/usr/local/bin/varnishadm /usr/local/bin/varnishreplay
/usr/local/bin/varnishhist  /usr/local/bin/varnishstat
/usr/local/bin/varnishlog /usr/local/bin/varnishtop
/usr/local/bin/varnishncsa  /usr/local/sbin/varnishd

The file varnishd, as its name implies, is the Varnish daemon — a perennial service that caches and serves content from memory. The other utilities listed above control and monitor the operation of varnishd. For instance, varnishstat continuously feeds you Varnish statistics. The file varnishadm lets you send administrative commands to varnishd while it's running.

Out of the box, varnishd does not cache a response with a cookie, nor does it honor the Cache-Control directives private and no-cache. Luckily, a little VCL fixes the problem, as shown in Listing 1. Code provided by Jean-François Bustarret (see Resources).

Listing 2. A fragment of VCL to conform with PHP
backend default {
  set backend.host = "127.0.0.1"; 
  set backend.port = "80"; 
}

sub vcl_recv {
  if (req.request != "GET" && req.request != "HEAD") {
    pipe;
  }
  if (req.http.Expect) {
    pipe;
  }
  if (req.http.Authenticate) {
    pass;
  }
  if (req.http.Cache-Control ~ "no-cache") {
     pass;
  }

  lookup;
}

sub vcl_fetch {
  if (!obj.valid) {
    error;
  }

  if (!obj.cacheable) {
    pass;
  }

  if (obj.http.Set-Cookie) {
    pass;
  }

  if (obj.http.Pragma ~ "no-cache" 
    || obj.http.Cache-Control ~ "no-cache" 
    || obj.http.Cache-Control ~ "private") {
    pass;
  }

  insert;
}

Let's march through the code:

  • The section backend default specifies which server to connect to if no command-line option (-b hostname:port) is given.
  • The function vcl_recv() is called when the daemon receives a client request. Conversely, vcl_fetch() is called when the requested object has been retrieved from the actual Web server or the request to the Web server has failed. As written, vcl_fetch() also rejects caching if either the Cache-Control or Pragma header is set to no-cache.
  • In the code, the operation pass implies "pass through," or do nothing for this individual request/response exchange. pipe also passes data through from client to server untouched, but does so for every subsequent transaction between the client and server. (pipe is a continuous uninterrupted pass until either end closes the connection.) lookup tries to find the response in the cache, while insert adds a response to the cache.

To continue, save the contents to a file — say, /usr/local/etc/varnish/php.vcl — and start varnishd with the command:

$ sudo varnishd -a localhost:8080 \
  -f /usr/local/etc/varnish/php.vcl

After a few moments, you should see output that resembles the following:

file ./varnish.einpln (unlinked) 
size 1069547520 bytes (1020 fs-blocks, 261120 pages)
Creating new SHMFILE

The varnishd daemon is now ready for connections. From a terminal window, run varnishstat. As shown in Figure 3, varnishstate reveals that the daemon is running (the runtime is shown at top left), although no activity has been recorded yet. The number of free bytes in the cache can be found at bottom.

Figure 3. varnishstat shows caching and connection activity
varnishstat shows caching and connection activity

Next, generate some activity. Connect to port 8080 and browse your Web site. Watch the varnishd monitor closely. Do any pages or assets appear in the cache?


Make your PHP cache-friendly

Before you continue, download and install the latest version of Firefox. When installed, launch Firefox and visit the add-ons page to install the Live HTTP Headers plug-in (see Resources). Among its many tricks, Live HTTP Headers shows you the HTTP headers of every incoming response. (You can filter out requests for image and CSS files, if you like.) Restart Firefox when prompted and open the Live HTTP Headers window.

Save the code in Listing 3 so your Web server can find it and point Firefox to the address of the new PHP page through the reverse proxy. For instance, if the URL was http://www.example.com/misc/cache.php, you'd likely point to http://localhost:8080/misc/cache.php. Visit the same URL from many browsers and through wget. You should see active use of the cache. Pause 20 seconds between requests; you should see an obvious cache miss because this content has expired per its outgoing headers.

Listing 3. Sample PHP code to manipulate the cache
<?php
  // 
  // Emit headers before any other content 
  // 
  cache_control( "public,max-age=10");
  expires( to_gmt( time() + 10 ) );
?>
<html>
  <head>
  </head>
  <body>  
    <?
     print to_gmt();
    ?> 
  </body>
</html>
<?php
  function to_gmt( $now = null ) {
    return gmdate( 'D, d M Y H:i:s',  ( $now == null ) ? time() : $now );
  }
  
  function last( $gmt ) {
    header("Last Modified: $gmt");
  }

  function expires( $gmt ) {
    header("Expires: $gmt");
  }

  function cache_control( $options ) {
    header("Cache-Control: $options");
  }
?>

Admittedly, this tiny application is only slightly more useful than Hello World, but it demonstrates all that's required to cooperate with Varnish and other reverse proxies. Even if you generate the bulk of your pages dynamically, those pages can likely be cached for a few seconds at least, sparing your server in the meantime. In contrast, you may want to preclude caching for other pages. Now you know the rules and have the software — Varnish — to realize a very valuable optimization.


Prepare for success

On the modern World Wide Web, most pages are no longer hand-coded. Instead, they are generated by an application and delivered on demand, making each page custom and personal.

But "There ain't no such thing as a free lunch." The time and effort required to craft HTML merely shifted from man to machine. And although a machine may be faster by orders of magnitude, it is nonetheless a finite resource. The smart PHP developer respects this certitude. The smart PHP developer plans to scale. Database queries are efficient, servers are made redundant, memory is used efficiently. And now, you can cache your dynamic pages. Bring on the traffic!

Resources

Learn

Get products and technologies

  • Download the Varnish source code.
  • Explore the PECL repository, your first stop for all known extensions and hosting facilities for downloading and developing PHP extensions.
  • Download the add-ons for Firefox at the Live HTTP Headers Firefox plug-in.
  • Innovate your next open source development project with IBM trial software, available for download or on DVD.
  • Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=292023
ArticleTitle=Scaling PHP applications with Varnish
publish-date=03042008