Contents


Deploying Gearman across multiple environments

Improve application performance and reduce server load

Comments

One of the greatest challenges with the modern computing environment is the distribution and effective use of computing resources. The general trend for computing at the moment is such that you can install a powerful machine cheaply and easily to perform a comparatively straightforward and simple task, but you may not be getting the best overall performance or usage of the machine. Conversely, many applications now find themselves with the need to perform relatively small operations, but possibly hundreds or thousands of times, without requiring (or being able to use) the power of a single machine.

As the size of computing farms increases, sometimes there is a need to perform an action or operation across all of these machines simultaneously or selectively to perform administration or installation tasks.

There are many solutions that can be used. For example, virtualization is one way in which computers can be sliced to get the best performance, although they have their own issues and limitations. The memcached tool, which makes use of spare memory, and the grid-like solutions of IBM cloud also support flexible provision of computing power so it can be used anywhere.

Gearman takes a different approach by providing a flexible mechanism for distributing discrete tasks around a group of machines. This method allows you to scale up and make more efficient use of your systems, while removing a lot of the complexity out of producing and supporting such a system. This extends and expands many of the solutions, including grid environments and even the basics of web services, but with a practical distribution approach. This makes Gearman useful in grid environments or Web environments where you want to share and distribute work and requests. It can also be used to offload tasks and process information from your web, database, and Amazon environments, which makes it ideal for supporting IBM WebSphere® or Amazon deployments.

Gearman basics

There are many elements to the Gearman service that make it more than just a way of submitting and sharing work, but the main system comprises three components: the gearmand daemon (or job server); clients for submitting requests to the Gearman service; and workers, which perform the actual work (Figure 1 demonstrates).

Figure 1. Summary of the Gearman service
Diagram shows the Gearman Job Server at the center of a number of clients and worker nodes

The gearmand server performs a simple function of collecting requests for work from clients and acting as a register where the workers can submit information about the types of work and operation they support. Unlike in a traditional grid service, where the grid nodes are designated for a particular task and given work by a management server, in Gearman, workers instead ask for work to do.

This important distinction means that decisions about which worker does what work are not required. There is no configuration, prioritization, or other organization required when distributing the work. Instead, the nature of the system is that it is self-running and self-managing, and for many installations, requires very little complexity from the basic structure that will initially be described here.

Gearman sample

The easiest way to see how Gearman works is to try a simple application that makes use of the Gearman service. You will be using Perl (see Related topics to download) for the worker and client elements in this example, although it is worth pointing out that you can mix and match languages to get the best out of a specific environment.

To start, you must have the gearmand server running. To do this, you can download the source, run the configure script, and make the server; you need the libevent library for this. Many Linux® distributions also have Gearman components available for installation without the manual process (see Related topics for libevent and Gearman).

Once installed, read the service with the -vv command-line options so you can see what the server is doing once you start using it.

For the client and worker components, you need to install the Gearman CPAN module, which includes the Gearman::Client and Gearman::Worker modules you need to create clients and workers.

For a simple worker, let's provide a basic word count.

Listing 1. Simple Gearman Worker
use Gearman::Worker;
my $worker = Gearman::Worker->new;
$worker->job_servers('192.168.0.2:4730');
$worker->register_function('wordcount' => \&wordcount);
$worker->work while 1;


sub wordcount
{
    my ($input) = @_;

    my @words = split /\s+/,$input->arg;

    return (scalar @words);
}

The first line loads the module, then you create a new Worker object. The job_servers method then registers one or more job servers that will be contacted to request available work. You've specified the IP address (or you could use a hostname) and the port number. This is particularly important since the default port number used for Gearman changed, and this can lead to problems with the system not working. Specifying it explicitly gets around that issue. You can specify multiple job servers so you can provide resilience (more than one job server means there is a spare in case the initial one fails) or for distributing different work with the same tasks, across different server groups.

The register_function() method associates a local function in the script with the name of a task. The taskname will be used by clients to request a particular operation. A worker can perform any number of tasks, and, perhaps more importantly, any worker can perform a different range of tasks. This enables you to support a computationally expensive task only on machines that have registered their ability to perform that task. Since it is the worker that requests the work for the tasks it is capable of performing, you shouldn't run into the problem where a task is delegated to a worker incapable of performing it.

The final part, the work() method, switches the worker so it can start to process requests. In the example, the loop condition means the worker will continue processing requests until you terminate the program.

The rest of the script is the function itself. In this case, you take the input supplied by the client as the first argument (using the arg() method to get its value), split that value by whitespace, then return the integer count of items in the array. It's a little more verbose than needed to more easily demonstrate the steps.

On the client side, some of the same basic steps occur. You create a Gearman Client object, specify the servers, and call the task that you want to perform. You can see that more clearly in Listing 2.

Listing 2. Creating a Gearman Client object
use Gearman::Client;
my $client = Gearman::Client->new;
$client->job_servers('192.168.0.2:4730');

my $result = $client->do_task('wordcount','the quick brown fox jumps over the lazy dog');

print "Words $$result\n";

The do_task() is the main method. It runs the specified task (wordcount) and the argument data. The result value is returned as a reference, so you have to de-reference it to get the word count you requested.

Now you can test it by first starting gearmand. If you use three -v options, you get information about the registration and connections, as in Listing 3, which shows the gearmand started and the worker script registered.

Listing 3. Starting gearmand
$ gearmand -vvv
 INFO Starting up
 INFO Listening on 0.0.0.0:4730 (6)
 INFO Creating wakeup pipe
 INFO Creating IO thread wakeup pipe
 INFO Adding event for listening socket (6)
 INFO Adding event for wakeup pipe
 INFO Entering main event loop
 INFO Accepted connection from 192.168.0.2:47158
 INFO [   0]     192.168.0.2:47158 Connected
 INFO [   0]     192.168.0.2:47158 Disconnected
 INFO Accepted connection from 192.168.0.2:47159
 INFO [   0]     192.168.0.2:47159 Connected
 INFO [   0]     192.168.0.2:47159 Disconnected
 INFO Accepted connection from 192.168.0.2:47160
 INFO [   0]     192.168.0.2:47160 Connected

The worker was started by running the script with Perl, but the worker doesn't produce any output. The client, on the other hand, returns the number of words in the sentence, as in Listing 4.

Listing 4. Running the Perl script for returning number of words in a sentence
$ perl client.pl 
Words 9

This all seems like a lot of work to get a simple word count. But remember that the machine handling the worker and the client sending the request could be on different sides of the world, or, just as easily, different parts of the same cloud.

Listing 5 shows the same output, this time using a different machine as the client. To run the other script, all you need are the Gearman libraries, script modules, and the script itself.

Listing 5. Same output using a different machine as a client
$ gearmand -vvv
 INFO Starting up
 INFO Listening on 0.0.0.0:4730 (6)
 INFO Creating wakeup pipe
 INFO Creating IO thread wakeup pipe
 INFO Adding event for listening socket (6)
 INFO Adding event for wakeup pipe
 INFO Entering main event loop
 INFO Accepted connection from 192.168.0.2:47158
 INFO [   0]     192.168.0.2:47158 Connected
 INFO [   0]     192.168.0.2:47158 Disconnected
 INFO Accepted connection from 192.168.0.2:47159
 INFO [   0]     192.168.0.2:47159 Connected
 INFO [   0]     192.168.0.2:47159 Disconnected
 INFO Accepted connection from 192.168.0.2:47160
 INFO [   0]     192.168.0.2:47160 Connected
 INFO Accepted connection from 192.168.0.2:56545
 INFO [   0]     192.168.0.2:56545 Connected
 INFO [   0]     192.168.0.2:56545 Disconnected
 INFO Accepted connection from 192.168.0.111:54307
 INFO [   0]   192.168.0.111:54307 Connected
 INFO [   0]   192.168.0.111:54307 Disconnected

To understand how useful that is, consider the flexibility behind the process. It may seem like another solution for web services-like functionality. However, you could have 20 workers registered and a client that just asks one of them to complete the work. There is no need to work out which one or use a complicated load-balancing system to decide for you.

Using Gearman across environments

The gearmand server is written in C, making it very portable across a range of UNIX® and Linux environments, and it is based on common open source tools (like the GNU configure and autobuild system), which should make it usable and deployable in many environments.

The client and worker interfaces are available in a host of languages, including Perl, PHP, Python, and the Java™ programming language — and even through the UNIX/Linux shell. In addition, there are also UDFs for Drizzle, MySQL, and PostgreSQL to allow you to interact directly within those databases with Gearman.

However, you are not limited to communicating between the same worker and client interfaces. You can, for example, call a task supported by a Python worker and request that information from the earlier Perl script. You can see the same word-count script in Python in Listing 6.

Listing 6. Word-count script in Python
from gearman import GearmanWorker

def wordcount(input):
    words = input.arg.split(None)
    print words
    return len(words)

worker = GearmanWorker(["127.0.0.1:4730"])
worker.register_function('wordcount', wordcount)
worker.work()

Running the worker and the client gets the same word count back.

Listing 7. Running the worker and client again
$ perl client.pl
Words 9

The advantage of this is that you can use Gearman from a variety of tools and integrate it into your existing application regardless of the environments you are using. If your Web application is based on PHP, but you want to make use of the application functionality in your existing WebSphere environment, you can do so through Gearman by extending Java technology with a worker and calling the right task from the PHP front end.

One thing to be careful of is the sharing of data. Gearman doesn't do any form of translation or manipulation of the data exchanged. For the simple strings and integers you have used here, this isn't a problem, but you couldn't share an array of values in PHP and expect it to be understood within the Java language. For this type of interaction, you can use one of the many structured data standards, such as JavaScript Object Notation (JSON) or XML. Alternatively, if you are working with information from a database, just share the ID or information required to find the data that needs to be processed, or use a transient method like memcached (although you may still need JSON or an equivalent).

Let's have a look at some more examples of Gearman deployments.

Deploying Gearman in dynamic environments

Of course, when you have such a flexible environment, an ideal environment for using it is when you have a flexible cloud of servers — Amazon's EC2, for example, or an existing large-scale infrastructure, such as a Web server farm where you have machines that may be underused or able to handle small discrete tasks.

When used in a cloud environment, you use the flexible nature of Gearman to ramp up the available workers when you need them. All that's needed to add workers to the Gearman system is for the worker script to be executed during boot time.

So consider the layout in Figure 2. You have a standard Gearman environment using your standard group of workers serving the needs of your clients. When the load on your environment suddenly increases, you can boot your EC2 instances, run the worker script to register their availability to do work, and pick up and process the information. The EC2 instances are therefore temporary — you can register and de-register the instances as you need them.

Figure 2. Standard Gearman environment using a standard group of workers
Diagram shows a group of clients accessing standard workers and cloud nodes through the Gearman Job Server

This can be really useful if there is processing and preparation involved in formulating your information. In that case, you can bring up the EC2 instances, do the work through Gearman, collate the responses, and shut the EC2 instances down. This saves your credit for when you really need tens or hundreds of machines to do the processing.

Using Gearman at the edges of data

There are often cases where you want to perform some operation on some information, but the processing is not time-critical or there is some distance (in network, not physical) terms between the source of the data and the destination.

So far, you have looked at ways of using Gearman where you want or need an immediate response. Gearman also includes the ability to initiate a background process. This occurs when the foreground client asks for the work to be done and doesn't worry about when (or even if) the response comes back.

Consider a web application that has a registration e-mail sent during the registration process. There are many potential problems to issuing such an e-mail live at the time the user clicks the Submit button on the web form. Problems reaching the mail server or just submitting the e-mail at busy times could delay the web app. With Gearman, you could submit the task to the queue and let one of your workers handle the actual processing and formatting of the e-mail and the sending, allowing your web interface to respond instantly. This is a good example of dispatching a background process where the front end doesn't need to wait for a response.

The same principle can also be used for other non time-sensitive elements. For example, if you are providing or using Twitter integration, you can use Gearman to handle the posting to the Twitter account. In this instance, there is no need for the content to be instant or for problems performing the posting to slow up the rest of your application. The Gearman service allows you to return a failure state that ensures that the task will be requeued and retried.

You can also use Gearman to handle database and other updates where the instantaneous nature of the update is not vital, so the information doesn't need to be written to the database live. In this situation, you can take advantage of other components in the modern web app arsenal, such as memcached.

A good example is an application that processes content, such as a document archiver where you want to build indexes and other information from the content. Although traditional queues for this type of operation have been available, Gearman makes it easy to spread and distribute that information around a group of machines and increase the performance of the indexing process.

One element that can help in these situations is combining Gearman's processing with memcached to allow you to submit the data, parse and process it, then update the cached version of that information automatically.

Within a blog or other content-management system, you could use this to allow the update and information to be published instantly by updating your memcached version of the content and updating the database in the background, or by updating the database and updating the memcached client-side version for display once the write has completed. Both solutions help to reduce the write contention on your database by reducing the simultaneous writes, while improving the responsiveness of your front-end application.

An example of using memcached can be seen in the modified versions of the client and worker script in Listing 8 and Listing 9. The client writes the string to be counted into memcached, and the worker uses the ID supplied by the client to read the string, count the words, then writes the information back to memcached. A hard-coded ID has been used in this case, but you could use an ID sourced from your database or UUID.

Listing 8. memcached-based client
use Gearman::Client;
use Cache::Memcached;

# Set up memcached

my $cache = new Cache::Memcached {
    'servers' => [
                   '192.168.0.2:11211',
                   ],
};

# Set up Gearman

my $client = Gearman::Client->new;
$client->job_servers('192.168.0.2:4730');

#
# Obtain the information you want to process
# and generate a unique key

my $id = 9334;

# Write some metadata

$cache->set(sprintf('doc-%d-srctxt',$id),
            'The quick brown fox jumps over the lazy dog');

my $result = $client->dispatch_background('wordcount',$id);

Listing 9 shows the modified worker script.

Listing 9. memcached-based worker
use Cache::Memcached;
use Gearman::Worker;

my $cache = new Cache::Memcached {
    'servers' => [
                   '192.168.0.2:11211',
                   ],
    };

my $worker = Gearman::Worker->new;
$worker->job_servers('192.168.0.2:4730');
$worker->register_function('wordcount' => \&wordcount);
$worker->work while 1;


sub wordcount
{
    my ($arg) = @_;

    my $id = $arg->arg;

    print STDERR "Providing word count for ",$id,"\n";

    my $string = $cache->get(sprintf('doc-%d-srctxt',$id));

    if (!defined($string))
    {
        $cache->set(sprintf('doc-%d-status',$id),
                    'Error: Source text not found');
        return;
    }

    my @words = split /\s+/,$string;
    $cache->set(sprintf('doc-%d-status',$id),
                'Complete');
    $cache->set(sprintf('doc-%d-result',$id),
                scalar @words);
}

Note that in the worker, the script uses the status and tagged memcached entries to hold information; failures can be written into the status. This allows a client to resubmit the work request in the event of a transient failure.

Because the client doesn't expect to get the response back, a separate script is required to get the information from memached when it is ready. Listing 10 shows a simple client script for this, identifying if there has been an error (and reporting it) or reporting the result.

Listing 10. Getting a result out of memcached
use Cache::Memcached;

my $cache = new Cache::Memcached {
    'servers' => [
                   '192.168.0.2:11211',
                   ],
};

my $id = 9334;

if ((my $result = $cache->get(sprintf('doc-%d-status',$id))) =~ m/Complete/)
{
    print "Count is ",$cache->get(sprintf('doc-%d-result',$id),),"\n";
}
else
{
    print "Result not ready: $result\n";
}

To use this, run the worker script: $ perl workermemc.pl. Then run the client script to submit the request into the queue: $ perl clientmemc.pl.

You can see whether the request has been completed by using the retrieval script.

Listing 11. Retrieval script
$ perl getresult.pl
Result not ready:

And you can see once the result is finally available.

Listing 12. A final result
$ perl client.pl
Words 9

You could repeat this with all sorts of data and environments, and include database writing and recovery into that process.

Conclusion

Gearman is a simple idea that is easy to use and yet provides a wealth of functionality, from simply distributing and sharing your load of work to allowing interoperability among different language and deployment environments. The queuing system allows you to utilize this functionality, and use it to improve application performance and reduce the load on your database and other server components by queuing or reducing the concurrency that normally causes performance problems at the front end.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=507148
ArticleTitle=Deploying Gearman across multiple environments
publish-date=08172010