Using PHP and cURL for server-side caching of dynamic web pages

Ways to significantly reduce processing and load time

Leon Kiriliuk, Igor Todorovski, and Nancy Wang explain how they save a substantial amount of processing and load time by using PHP and cURL to cache dynamic web pages on the server. Their method caches session variables, as well as the HTML.

Share:

Igor Todorovski (itodorov@ca.ibm.com), Software Developer, IBM

Author1 photoIgor Todorovski is a software developer in the IBM XL Compilers group. He has been with IBM since 2008 and specializes in IBM z/OS C/C++ compilers.



Nancy Wang (wangn@ca.ibm.com), Software Developer, IBM

author photoNancy Wang is a software developer in the IBM XL Compilers group. She has been with IBM since 2007 and has worked on the UPC, Fortran, and C/C++ compilers.



Leon Kiriliuk (leonk@ca.ibm.com), Test Manager, IBM

sourceLeon Kiriliuk is a manager of the IBM COBOL and PLX test teams. Leon is also an experienced PHP developer, and he created the compiler technologies website



18 June 2013

Introduction

This article documents a method to cache dynamic web pages on the server side using PHP and cURL. The method is based on the assumption that the webpage is purely dynamic and receives input from the user. It uses frequently submitted user forms as a guide to direct the caching.

This method caches the relevant session variables, as well as the HTML script. When the web page is reloaded with the cached data, the information is retrieved instantaneously and the relevant session data is restored.

Dynamic web pages allow the web content to change and are geared toward the user's requirements. However, it is this dynamic nature that can cause significant increase in load time.

If your website is dynamic, it probably requires significant data processing. This is generally the case for statistics-based websites. Typically, the user submits a query, and the back-end receives the query and processes the data. After a while, the back-end sends the information to the user in HTML format.


Problem

Consider a website that displays statistics-based on a user-specified time period. When the user submits the web form, the server script (PHP, for example) processes the data in the background and displays the information to the user up to 30 minutes later. During those 30 minutes, the PHP process is gathering the relevant data, analyzing it, comparing it, and displaying the information in HTML output.

In addition, the web page uses session variables to save statistical data based on the user's form request. The website is dependent on the session data to display statistics across multiple pages.

Figure 1. Request and response processing
Drawing. User sends request, receives response

The website provides yearly and monthly statistics as default pages. These pages represent the most traffic for the entire website. But these pages still take up to 30 minutes to load. How can such pages be optimized? How can we maintain the session data while also optimizing the performance of the website. Finally, how do we go from 30-minute load times to instantaneous load times while maintaining the session data? The answer is caching.


Solution

The solution is to cache your web pages through a program that retrieves content from the web, such as cURL or Wget. For this article, we use the cURL program to demonstrate this solution. The solution provided uses the PHP language, but it can be applied to any language, such as Java, Python, Ruby, or shell script. Additionally, this solution is especially useful for users who prefer speed over up-to-date data.

Caching dynamic web pages

The cURL command is normally preinstalled on UNIX systems, but you can download it (see Resources for a link).

Assume that your web page has the address of http://localhost/. When a user submits a query with a specified time period, such as starttime=2012.01.01, the GET variable will be set as follows:
http://localhost/?starttime=2012.01.01

Using cURL, you can dynamically cache frequently specified form inputs. For example, you can use the cURL command as follows:
/usr/bin/curl -s http://localhost/?starttime=2012.01.01 > /cache/localhost.2012.01.01.html

This command will fetch the output from this URL:
http://localhost/?starttime=2012.01.01

And redirect to a cache file:
/cache/localhost.2012.01.01.html

You can add this script as a cron job that will periodically cache frequently requested web pages. For example, the cron job can run every day and save an updated HTML cache file daily.

Loading cached web pages

In the PHP script that processes the dynamic requests, rather than processing the data repeatedly, the solution is to load the cached HTML data if the data for the selected starttime is already cached.

Listing 1. Basic caching example
<?php
$starttime = $_GET['starttime'];
$cache_file = "/cache/localhost.$starttime.html";

if (file_exists($cache_file)) {
	// Cache exists
	echo explode("\n", file_get_contents($cache_file));
} else {
 	// Cache does not exists
	// Dynamically process the data and display it
}
?>

In the PHP script in Listing 1, the script checks whether the cache file exists for the user-specified starttime. If it does, it loads the cache file and prints the cache file contents. If the cache file does not exist, the data is dynamically loaded and displayed as before.

Improving the solution

That solution is a start, but it does not consider these cases:

  • When using cURL to cache your website, always go through the dynamic path in case the PHP code was modified. We will introduce a new GET variable to get around this.
  • The relevant sessions are not stored and should be restored when a cache file is loaded.
  • Minimize the cached HTML by using an HTML compression tool.
  • Cache only when the data is updated.
  • Minimize PHP queries for caching data in the back-end script.

The script in Listing 2 introduces a dynamicpath GET variable. It solves the first problem, which makes sure that the dynamic path is also chosen when caching. When caching, you now pass in the dynamicpath GET variable. This will avoid the cached path, thus ensuring that the dynamic path is chosen when creating a new cached HTML file:

/usr/bin/curl -s http://localhost/?starttime=2012.01.01&dynamicpath > /cache/localhost.2012.01.01.html

Listing 2. Improved caching example
<?php
$starttime = $_GET[''starttime''];
$cache_file = "/cache/localhost.$starttime.html";
$dynamicpath = $_GET[''dynamicpath''];

if (file_exists($cache_file) && !isset($dynamicpath)) {
	// Display cache contents
	echo explode("\n", file_get_contents($cache_file));
} else {
 	// Cache does not exist or force a cache
	// Dynamically load the data and display it
}
?>

Next, add support for SESSION variables. SESSION variables are used to store information or change settings for a user session.

If your dynamic web page provides multiple views of the same data (a high-level overview and a detailed view), then sessions are an effective way to transmit the data from one web page to another. Unfortunately, when caching the web page into a static HTML file, the session data is lost. If your website relies on session variables, caching sessions is important to maintain the same user experience.

Although you can access the session files from the sesson.save_path PHP configuration, we use the PHP functions session_encode() and session_decode() to store and restore sessions.

session_encode()
returns a serialized string of the contents of the current session data stored in the $_SESSION superglobal

session_decode()
decodes the serialized session data provided and populates the $_SESSION superglobal with the result

Listing 3. Improved caching example with storing and restoring sessions
<?php
$session_start();
$starttime = $_GET['starttime'];
$cache_file = "/cache/localhost.$starttime.html";
$dynamicpath = $_GET['dynamicpath'];
$session_cache = "/cache/localhost.$starttime.sessions";

if (file_exists($cache_file) && !isset($dynamicpath)) {
	// Display cache contents
	echo explode("\n", file_get_contents($cache_file));
	// Load sessions variables
	$filehandle = fopen ($session_cache, 'r'); 
	// open file containing session data
	$sessiondata = fread ($filehandle, 4096); 
	// read the session data from file
fclose ($filehandle);
	session_decode($sessiondata); // Decode the session data
else {
 	// Cache does not exist or force a cache
	// Dynamically load the data and display it
	...
	// Save session data 
	if (isset($_GET['savesessions'])) {
		$session_data = session_encode(); // Get the session data
		$filehandle = fopen ($session_cache, 'w+'); 
		// open a file write session data
		fwrite ($filehandle, $session_data); 
		// write the session data to file
		fclose ($filehandle);
	}
}

?>

In the script in Listing 3, the sessions are saved every time that the savesessions GET variable is specified.

Therefore, the cron job script should be altered:

/usr/bin/curl -s http://localhost/?starttime=2012.01.01&dynamicpath&savesessions > /cache/localhost.2012.01.01.html

Upon loading the cached data, the $session_cache is read, and the $_SESSION variable is subsequently filled in with the session data by calling the session_decode() php function.

Minimizing the cached HTML file

To improve the performance of the cached HTML file, you can consider reducing the size of the file. This will reduce the bandwidth when transmitting the HTML back to the HTML browser.

As an example, you can reduce the size of the HTML file by removing all whitespace characters. Several tools exist, such as the open source tool, htmlcompressor.

(see Resources for a link), which can reduce HTML files by up to 30%. It is a Java program that takes the original HTML file as input and outputs the minimized HTML file to stdout.

By using html compressor, the caching mechanism is split into two steps:

/usr/bin/curl -s http://localhost/?starttime=2012.01.01&dynamicpath&savesessions > /cache/localhost.2012.01.01.html.large
java -jar htmlcompressor.jar --compress-js /cache/localhost.2012.01.01.html.large > /cache/localhost.2012.01.01.html

A compressed version of the HTML file is now cached.

Update the cache file intelligently.

The previous solution did not rely on a specific time or date for applying the caching solution. The quick approach is to run the caching script every hour or so as a cron job. However, this imposes an unnecessary load on the machine.

An intelligent way to cache the HTML is to continue to run the caching script in a cron job, but also query the statistics database for when it was last updated. If it was updated since the last cache, rerun the caching script; otherwise, do not rerun the caching script. With this approach, the caching script runs only when the database is updated.

An alternative approach is to allow users who are accessing the website to create cache files on the fly. Because the cron script is essentially an HTTP request, a user accessing the web page can request a cache. The initial load will take a long time, but subsequent loads will be instantaneous. This approach is better if the frequently used form inputs are not predictable or if certain statistics are popular one day and not the other.

Minimizing PHP queries for caching data in the back-end script

When the cache file does not exist, the PHP script must still go through the dynamic path. To improve the performance of the dynamic path, consider caching the dynamic database queries in one XML file. You can do this by writing a back-end script that periodically performs the same queries as the PHP script but stores them in an XML file. The dynamic path of the PHP script can then read the XML file rather than performing the slower SQL queries during run time.


Summary

This method is very helpful for websites that are dynamic but also rely on frequently used user inputs. For statistics-based websites that offer statistics on a month-to-month basis by default, this can be a good approach to caching the default page. Most users do not deviate from the default, so the caching approach can help speed up performance significantly for the majority of users.


Acknowledgements

The authors thank Kobi Vinayagamoorthy, and Rosanne Jolin who also helped make this article possible.

Resources

Learn

Get products and technologies

  • Download a freee trial version of Rational software.
  • Evaluate other IBM softwarein the way that suits you best: Download it for a trial, try it online, use it in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Rational software on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Rational
ArticleID=929131
ArticleTitle=Using PHP and cURL for server-side caching of dynamic web pages
publish-date=06182013