Batch processing in PHP

How to create long-running jobs

What do you do when you have a feature in your Web application that takes longer than a second or two to finish? You need some type of offline processing solution. Check out several methods for offline servicing of long-running jobs in your PHP application.

Jack D. Herrington (jherr@pobox.com), Senior Software Engineer, Leverage Software

Jack D. Herrington is a senior software engineer with more than 20 years of experience. He's the author of three books: Code Generation in Action, Podcasting Hacks, and PHP Hacks. He has also written more than 30 articles.



05 December 2006

Also available in Chinese Russian

Big grocery chains have a big problem. Each day, thousands of transactions take place at each store when people buy groceries. Corporate executives want to mine that data. Which products are selling well? Which aren't? Where are organic products selling well? What about ice cream?

To capture this data, the organization must boil all that transactional data down into a data model better suited to generating the types of reports the folks at corporate need. However, that takes time, and as the chain grows, it may take longer than a day to process a day's worth of data. Thus, the big problem.

Now, your Web application may not be processing that much data, but any site can have processes that take longer than a customer is willing to wait. The generally accepted time a customer can wait before a process appears "slow" is 200 milliseconds. That number is based on desktop applications, and I think the Web has trained us to be a bit more forgiving. Nevertheless, you don't want to keep customers waiting more than a few seconds. So, here are a few strategies for batch-processing jobs in PHP.

Prose and cron

The central player in batch processing on UNIX® machines is the cron daemon. This daemon reads a configuration file that tells it which command lines to run and how often. The daemon then executes them, well, like clockwork. It even sends any error output to a specified e-mail address so you can debug problems when they occur.

Now I know engineers who are strong advocates of a competing Web technology that has a name associated with coffee are shaking their heads. "Threads! Threads are the true way to do background processing. The cron daemon is a dinosaur."

I politely disagree.

I've done both, and I think cron has the advantage of the "Keep It Simple, Stupid" (KISS) principle. It keeps the background processing simple. Instead of having a multithreaded job-processing application that runs forever and, thus, can never leak memory, you have a simple batch script that cron starts. The script determines whether there's anything to do, does it, then exits. No need to worry about memory leaks. No need to worry about a thread stalling or getting caught in an infinite loop.

So, how does cron work? Well, it depends on your hosting solution. I'll just stick to the old simple UNIX command-line version of cron, and you can check with your system administrator to see how you can implement it in your Web application.

Here is a simple cron configuration to run a PHP script once a day at 11 p.m.:

0 23 * * * jack /usr/bin/php /users/home/jack/myscript.php

The first five fields define the times when the script should be launched. After that comes the name of the user that should be used to run the script. The rest of the line is the command line to execute. The time fields are minute, hour, day of the month, month, and the day of the week. Here are a few more examples.

The command:

15 * * * * jack /usr/bin/php /users/home/jack/myscript.php

runs the script at the 15-minute mark of every hour.

The command:

15,45 * * * * jack /usr/bin/php /users/home/jack/myscript.php

runs the script at the 15- and 45-minute mark of every hour.

The command:

*/1 3-23 * * * jack /usr/bin/php /users/home/jack/myscript.php

runs the script every minute from 3 a.m. to 11 p.m.

The command:

30 23 * * 6 jack /usr/bin/php /users/home/jack/myscript.php

runs the script at 11:30 p.m. on Saturdays (the day of the week specified as 6).

As you can see, the combinations are limitless. You have as much control as you need over when the script runs. You can also specify more than one script to run, so that some run every minute, while others -- perhaps a backup script -- would only run once a day.

To specify where any errors reported should be sent as e-mail, use the MAILTO directive, as shown:

MAILTO=jherr@pobox.com

Note: For Microsoft® Windows® users, there is an equivalent Scheduled Tasks system for launching command-line processes (like a PHP script) periodically.


The basics of batch-processing architecture

Batch processing is relatively straightforward. In most cases, it comes down to one of two workflows. The first is used in reporting; the script runs once a day to generate reports and send them out to a set group of people. The second is a batch job created in response to some request. For example, I log in to the Web application and ask it to send all the people registered in the system a message telling them about a great new feature. This action has to be done as a batch because there are 10,000 people on the system. Such a task will take PHP a while to complete, so it must be done by a job outside the browser.

In this second workflow, the Web application simply drops information in a location shared with the batch-processing application. That information specifies the nature of the job (for example, "Send this e-mail to all the people on the system"). The batch processor runs the job, then deletes the job. Alternatively, the processor marks the job as completed. Either way, the job should be identified as completed so it doesn't run again.

The rest of this article shows various methods for sharing data between the Web application front end and the batch-processing back end.


The mail queue

The first version is a dedicated mail queuing system. In this model, there's a table in the database with a list of e-mail messages that should be sent out to various people. The Web interface uses the mailouts class to add an e-mail to the queue. The e-mail processor uses the mailouts class to retrieve the pending e-mail, then uses it again to delete the pending messages from the queue.

The model starts the with the MySQL schema.

Listing 1. mailout.sql
DROP TABLE IF EXISTS mailouts;
CREATE TABLE mailouts (
  id MEDIUMINT NOT NULL AUTO_INCREMENT,
  from_address TEXT NOT NULL,
  to_address TEXT NOT NULL,
  subject TEXT NOT NULL,
  content TEXT NOT NULL,
  PRIMARY KEY ( id )
);

This schema is pretty simple. Each row has a from and a to address, along with a subject and the content of the e-mail.

Wrapped around the mailouts table in the database is the PHP mailouts class.

Listing 2. mailouts.php
<?php
require_once('DB.php');

class Mailouts
{
  public static function get_db()
  {
    $dsn = 'mysql://root:@localhost/mailout';
    $db =& DB::Connect( $dsn, array() );
    if (PEAR::isError($db)) { die($db->getMessage()); }
    return $db;
  }
  public static function delete( $id )
  {
    $db = Mailouts::get_db();
    $sth = $db->prepare( 'DELETE FROM mailouts WHERE id=?' );
    $db->execute( $sth, $id );
    return true;
  }
  public static function add( $from, $to, $subject, $content )
  {
    $db = Mailouts::get_db();
    $sth = $db->prepare( 'INSERT INTO mailouts VALUES (null,?,?,?,?)' );
    $db->execute( $sth, array( $from, $to, $subject, $content ) );
    return true;
  }
  public static function get_all()
  {
    $db = Mailouts::get_db();
    $res = $db->query( "SELECT * FROM mailouts" );
    $rows = array();
    while( $res->fetchInto( $row ) ) { $rows []= $row; }
    return $rows;
  }
}
?>

The script includes the Pear::DB database access class. Then it defines a mailouts class with three central static functions: add, delete, and get_all. The add() method adds an e-mail to the queue and is meant to be used by the front end. The get_all() method returns all the data from the table. The delete() method deletes an individual method.

You might ask why I don't just have a delete_all() method that would be called at the end of the script. Such a method doesn't exist for two reasons: If I delete each message after I send it, there's no possibility that a message could be sent twice if a script is rerun after a problem; and new messages could have been added between the start of the batch job and its completion.

The next step is to write a simple test script that adds an entry to the queue.

Listing 3. mailout_test_add.php
<?php
require 'mailout.php';

Mailouts::add( 'donotreply@mydomain.com',
  'molly@nocompany.com.org',
  'Test Subject',
  'This is a test of the batch mail sendout' );
?>

In this case, I'm adding a mailout to Molly at some company, with a test subject and e-mail body. I can run this script on the command line: php mailout_test_add.php.

To send the e-mail, I need another script that acts as my job processor.

Listing 4. mailout_send.php
<?php
require_once 'mailout.php';

function process( $from, $to, $subject, $email ) {
  mail( $to, $subject, $email, "From: $from" );
}

$messages = Mailouts::get_all();
foreach( $messages as $msg ) {
  process( $msg[1], $msg[2], $msg[3], $msg[4] );
  Mailouts::delete( $msg[0] );
}
?>

This script uses the get_all() method to retrieve all the e-mail messages, then uses PHP's mail() method to send out the messages one by one. After each is successfully sent, the delete() method removes that individual record from the queue.

This script would run at periodic intervals using the cron daemon. How often the script runs is up to you and the needs of your application.

Note: The PHP Extension and Application Repository (PEAR) repository contains an excellent real-world implementation of a mail queue system, available for download at no cost.


Something more generic

Having a solution dedicated simply to e-mail send-outs is fine, but what about something more generic? Something that lets me send e-mail or generate reports or do other time-expensive processing without having to wait in my browser.

To do that, I can take advantage of the fact that PHP is an interpreted language by storing PHP code in a queue in the database, then executing it later. Doing so requires two tables, as shown in Listing 5.

Listing 5. generic.sql
DROP TABLE IF EXISTS processing_items;
CREATE TABLE processing_items (
  id MEDIUMINT NOT NULL AUTO_INCREMENT,
  function TEXT NOT NULL,
  PRIMARY KEY ( id )
);

DROP TABLE IF EXISTS processing_args;
CREATE TABLE processing_args (
  id MEDIUMINT NOT NULL AUTO_INCREMENT,
  item_id MEDIUMINT NOT NULL,
  key_name TEXT NOT NULL,
  value TEXT NOT NULL,
  PRIMARY KEY ( id )
);

The first table, processing_items, holds the functions to be called by the job processor. The second table, processing_args, contains the arguments to send to the function as a hash table using key/value pairs.

These two tables are, like the mailouts table, wrapped by a PHP class, this time called ProcessingItems.

Listing 6. generic.php
<?php
require_once('DB.php');

class ProcessingItems
{
  public static function get_db() { ... }
  public static function delete( $id )
  {
    $db = ProcessingItems::get_db();
    $sth = $db->prepare( 'DELETE FROM processing_args WHERE item_id=?' );
    $db->execute( $sth, $id );
    $sth = $db->prepare( 'DELETE FROM processing_items WHERE id=?' );
    $db->execute( $sth, $id );
    return true;
  }
  public static function add( $function, $args )
  {
    $db = ProcessingItems::get_db();

    $sth = $db->prepare( 'INSERT INTO processing_items VALUES (null,?)' );
    $db->execute( $sth, array( $function ) );

    $res = $db->query( "SELECT last_insert_id()" );
    $id = null;
    while( $res->fetchInto( $row ) ) { $id = $row[0]; }

    foreach( $args as $key => $value )
    {
        $sth = $db->prepare( 'INSERT INTO processing_args
  VALUES (null,?,?,?)' );
        $db->execute( $sth, array( $id, $key, $value ) );
    }

    return true;
  }
  public static function get_all()
  {
    $db = ProcessingItems::get_db();

    $res = $db->query( "SELECT * FROM processing_items" );
    $rows = array();
    while( $res->fetchInto( $row ) )
    {
        $item = array();
        $item['id'] = $row[0];
        $item['function'] = $row[1];
        $item['args'] = array();

        $ares = $db->query( "SELECT key_name, value FROM
   processing_args WHERE item_id=?", $item['id'] );
        while( $ares->fetchInto( $arow ) )
            $item['args'][ $arow[0] ] = $arow[1];

        $rows []= $item;
    }
    return $rows;
  }
}
?>

This class contains three important methods: add(), get_all(), and delete(). Just like the mailouts system, the front end uses add(), and the processing engine uses get_all() and delete().

A test script to add an element to the processing item queue is shown in Listing 7.

Listing 7. generic_test_add.php
<?php
require_once 'generic.php';
ProcessingItems::add( 'printvalue', array( 'value' => 'foo' ) );
?>

In this case, I'm adding a call to the printvalue function with the value argument set to foo. I use the PHP command-line interpreter to run that script and put the method call into the queue. Then I use the following processing script to run the method.

Listing 8. generic_process.php
<?php
require_once 'generic.php';

function printvalue( $args ) {
  echo 'Printing: '.$args['value']."\n";
}

foreach( ProcessingItems::get_all() as $item ) {
  call_user_func_array( $item['function'],
    array( $item['args'] ) );
  ProcessingItems::delete( $item['id'] );
}
?>

This script is amazingly simple. It takes the processing items get_all() returns, then uses call_user_func_array -- an internal PHP function -- to invoke the method dynamically with the given arguments. In this case, the local printvalue function is called.

To demonstrate this functionality, I show what happens on the command line:

% php generic_test_add.php 
% php generic_process.php 
Printing: foo
%

It's not much to look at, but you get the point. Through this mechanism, you can delay the processing of any PHP function to a later time.

Now, if the idea of putting PHP function names and arguments into the database is abhorrent to you, an alternative is to have a mapping in the PHP code that goes between a "processing job type" name stored in the database to a real PHP handler function. In this way, if, at a later date, you decide to change the PHP back end for something else, it will still work if the "processing job type" strings match.


Dumping the database

To finish the code examples, I show a slightly different angle, which is to use files in a directory to store the batch jobs instead of using the database. I offer this idea here not so much in the mindset of "Do it this way, rather than in a database," but as a design alternative you can use -- or not use, as you choose.

Obviously, there is no schema because we aren't using a database. So I start with a class that wraps the same type of add(), get_all(), and delete() methods used in the previous examples.

Listing 9. batch_by_file.php
<?php
define( 'BATCH_DIRECTORY', 'batch_items/' );
class BatchFiles
{
  public static function delete( $id )
  {
    unlink( $id );
    return true;
  }
  public static function add( $function, $args )
  {
    $path = '';
    while( true )
    {
        $path = BATCH_DIRECTORY.time();
        if ( file_exists( $path ) == false )
            break;
    }

    $fh = fopen( $path, "w" );
    fprintf( $fh, $function."\n" );
    foreach( $args as $k => $v )
    {
        fprintf( $fh, $k.":".$v."\n" );
    }
    fclose( $fh );

    return true;
  }
  public static function get_all()
  {
    $rows = array();
    if (is_dir(BATCH_DIRECTORY)) {
        if ($dh = opendir(BATCH_DIRECTORY)) {
            while (($file = readdir($dh)) !== false) {
                $path = BATCH_DIRECTORY.$file;
                if ( is_dir( $path ) == false )
                {
                    $item = array();
                    $item['id'] = $path;
                    $fh = fopen( $path, 'r' );
                    if ( $fh )
                    {
                        $item['function'] = trim(fgets( $fh ));
                        $item['args'] = array();
                        while( ( $line = fgets( $fh ) ) != null )
                        {
                            $args = split( ':', trim($line) );
                            $item['args'][$args[0]] = $args[1];
                        }
                        $rows []= $item;
                        fclose( $fh );
                    }
                }
            }
            closedir($dh);
        }
    }
    return $rows;
  }
}
?>

The BatchFiles class has the three primary methods: add(), get_all(), and delete(). Instead of going to the database, the class reads and writes files from a directory named batch_items.

To add a new batch item, I use the following test code.

Listing 10. batch_by_file_test_add.php
<?php
require_once 'batch_by_file.php';

BatchFiles::add( "printvalue", array( 'value' => 'foo' ) );
?>

It's worth noting that, with the exception of the name of the class -- BatchFiles -- there is really no way of knowing how the job is being stored. So it would be easy enough to change this over to database-style storage later without changing the interface.

Finally, there is the processor code.

Listing 11. batch_by_file_processor.php
<?php
require_once 'batch_by_file.php';

function printvalue( $args ) {
  echo 'Printing: '.$args['value']."\n";
}

foreach( BatchFiles::get_all() as $item ) {
  call_user_func_array( $item['function'], array( $item['args'] ) );
  BatchFiles::delete( $item['id'] );
}
?>

This code is almost identical to the database version, with some changed file and class names.


Conclusion

As mentioned, there's a lot of support for threading on servers to handle background batch processing. Certainly, in some small cases, it's a bit easier to fire off of a helper thread to handle small jobs. But it's easy to see that with the use of conventional tools -- cron, MySQL, standard object-oriented PHP, and Pear::DB -- creating batch jobs in PHP applications is easy to do, easy to deploy, and easy to maintain.

Resources

Learn

  • PHP.net is an excellent resource for PHP coders.
  • The PEAR Mail_Queue package is a solid implementation of a mail queue with a database back end.
  • The crontab manual has a difficult-to-understand version of the details of cron configuration.
  • The PHP manual has a whole section on Using PHP from the command line that will help you understand how the script is run from cron.
  • Visit IBM developerWorks' PHP project resources to learn more about PHP.
  • Stay current with developerWorks technical events and webcasts.
  • Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
  • Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
  • To listen to interesting interviews and discussions for software developers, be sure to check out developerWorks podcasts.

Get products and technologies

Discuss

  • The developerWorks PHP Developer Forum provides a place for all PHP developer discussion topics. Post your questions about PHP scripts, functions, syntax, variables, PHP debugging and any other topic of relevance to PHP developers.
  • Get involved in the developerWorks community by participating in developerWorks blogs.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=177110
ArticleTitle=Batch processing in PHP
publish-date=12052006