Accelerating web applications with OpenCL

High-performance computing in the browser

The Web Computing Language (WebCL) enables web applications to execute functions on the host's graphics processor, dramatically accelerating many types of data-processing routines, such as data sorting, text searching, and solving systems of matrices. Learn how WebCL's JavaScript classes enable high-speed number-crunching by delivering computing tasks to graphics processors.

Matthew Scarpino (mattscar@gmail.com), Software developer, Eclipse Engineering LLC

Matthew Scarpino is a software developer in Mountain View, California, who specializes in high-performance graphical processing. He is the author of OpenCL in Action and openclblog.com. He is currently coding an open source solid modeling tool that combines OpenGL and OpenCL.



23 April 2013

Also available in Chinese Japanese

Because of their highly parallelized architectures, graphics processor units (GPUs) can execute certain types of applications much faster than traditional central processor units (CPUs). The Open Computing Language (OpenCL) is one of the most popular languages available for harnessing a GPU's power. A prominent example is Adobe® Premiere® Pro CS6, which accelerates image and video processing by executing OpenCL routines on the user's GPU.

Several companies decided that GPU acceleration would be a great help in browser-based applications and formed a working group to advance the technology. In May 2012, the group released a preliminary draft of the Web Computing Language (WebCL). As stated on its main site (see Resources), the WebCL working group intends to "enable web applications to harness GPU and multicore CPU parallel processing from within a web browser."

Two companies have released extensions making it possible to call WebCL functions from inside a browser. Samsung has released a WebCL extension for WebKit, the engine that powers the Apple Safari browser on Mac OS X. Nokia has released a WebCL plug-in for Mozilla Firefox that runs on Windows® and 32-bit Linux® operating systems. Because of the wider developer base, this article focuses on Nokia's implementation. This article explains how to install WebCL and the basics of WebCL coding. Afterward, it demonstrates how to use WebCL to search through text at high speed.

Installing WebCL

WebCL applications require three components: an OpenCL software development kit (SDK), the Firefox browser, and Nokia's plug-in for Firefox.

Obtain an OpenCL SDK

WebCL applications call OpenCL functions on the host computer. Therefore, OpenCL must be installed before any WebCL application can run. OpenCL development kits are device- and operating-system-specific; therefore, to run routines on a Windows computer with an Nvidia GPU, Nvidia's OpenCL SDK for Windows is required. While it is not possible to provide installation directions for all the different OpenCL SDKs within this article, it will point you in the right direction:

  • To execute routines on an AMD CPU or GPU, download the AMD Accelerated Parallel Processing (APP) SDK (see Resources for a link).
  • To execute routines on an Nvidia GPU, download the Nvidia GPU Computing SDK (see Resources for a link).
  • To execute routines on an Intel® CPU, download the Intel SDK for OpenCL (see Resources for a link).

Multiple OpenCL SDKs can be installed on a computer without creating conflicts.

CPUs and GPUs

Modern CPUs contain several processing elements, called cores. Each core has its own processing pipeline and data storage, and cores communicate with each other using methods like direct memory access. Programming a multicore CPU is like leading a small team of highly trained spies: Each member can execute a complex mission on his or her own, but the process works better as a team.

In contrast, GPU processing is performed by work items, which are limited with regard to memory and processing resources. Work items aren't good at making decisions, and they're even worse at coordinating tasks among themselves. An advantage of a GPU, however, is it can execute many thousands of work items at the same time. So programming a GPU is like leading a vast army of zombies. As long as the commands are simple, a lot of work can be accomplished.

Install WebCL in Firefox

After installing an OpenCL SDK, installing WebCL is easy, as shown in the following steps:

  1. Download Firefox from the Mozilla site (see Resources for a link).
  2. In Firefox, go to Nokia's WebCL site (see Resources for a link) and click on Nokia WebCL extension.
  3. If necessary, allow Firefox to download the Firefox extension (an *.xpi file). In the Software Installation dialog box, click on Install Now, then restart the browser.
  4. To test the installation, go to Nokia's WebCL site and select Click here to check that you have WebCL enabled.

    If a dialog box appears with the words Excellent! Your system does support WebCL, WebCL has been successfully installed. If the dialog box says, Unfortunately your system does not support WebCL, then you may need to reinstall the extension or access WebCL from a different computer.


Writing WebCL applications

WebCL terminology can be mystifying to newcomers, so it is important to remember the overall goal:

  • Delivering a function to a device.
  • Executing the function on the device.
  • Transferring output from the device to the host.

Figure 1 shows this process.

Figure 1. Operation of a WebCL application
The host processor (CPU) deploying a kernel to a target device (GPU)

WebCL classes and functions

To build a WebCL application with Nokia's toolset, familiarity with seven JavaScript classes is required. Table 1 lists each class along with its important functions.

Table 1. Important WebCL classes and functions
ClassPurposeFunctionFunction description
IWebCLContains static functions that access platforms and devicesgetPlatformIDs()Returns an array of IWebCLPlatform objects
createContext(properties, devices)Returns an IWebCLContext object containing the given devices
createContextFromType(properties, device_type)Returns an IWebCLContext object containing devices of a given type
IWebCLPlatformRepresents an installed OpenCL SDK on the hostgetDeviceIDs(device_type)Returns an array of IWebCLDevice objects of a given type (optional)
IWebCLDeviceRepresents a physical device compliant with the OpenCL standardgetDeviceInfo(name)Returns information about the device
IWebCLContextManages programs, memory objects, and command queuescreateProgramWithSource(source_text)Creates an IWebCLProgram object from source code
createBuffer(flags, size)Creates an IWebCLMemoryObject object to hold data of the given size
createCommandQueue(device, properties)Creates an IWebCLCommandQueue object to deliver commands to the given device
IWebCLProgramSource code containing one or more kernel functionsbuildProgram(devices, options)Compiles the program's source code for the specified devices
createKernel(kernel_name)Creates an IWebCLKernel from the function with the given name
IWebCLKernelSpecially coded function capable of being executed by an OpenCL-compliant device setKernelArg(index, value, type)Creates a kernel function argument from a memory object
IWebCLCommandQueueEnables communication between the host and deviceenqueueReadBuffer(mem_object, blocking, offset, size, data_object, wait_list)Reads a memory object from the device to the host
enqueueWriteBuffer(mem_object, blocking, offset, size, data_object, wait_list)Reads a memory object from the device to the host
enqueueTask(kernel, wait_list)Enqueues an IWebCLKernel object for execution with a single work item
enqueueNDRangeKernel(kernel, dim, offset, global_size, local_size, wait_list)Enqueues an IWebCLKernel object for execution with multiple work items

It takes time to understand how these classes work and work together, but once coded, a WebCL application can be copied and pasted into other applications with only minor changes.

A simple example

The example code, which is available from the Download section, contains a file called simple.html that provides a complete WebCL application. The JavaScript code, like the code in most WebCL applications, can be divided into five steps:

  1. A WebCL application starts by creating an IWebCLContext object to manage kernel deployment. The WebCL.createContextFromType function accepts two arguments: an array containing an enumerated value, and a value that identifies the type of devices that should be placed in the context. The following code shows how this works.
    Listing 1. Create a context and access its devices
    var platforms = WebCL.getPlatformIDs();
    var ctx_props = [WebCL.CL_CONTEXT_PLATFORM, platforms[0]];
    var ctx = WebCL.createContextFromType(ctx_props, WebCL.CL_DEVICE_TYPE_GPU);
    var devices = ctx.getContextInfo(WebCL.CL_CONTEXT_DEVICES);

    The CL_DEVICE_TYPE_GPU parameter indicates that the context should only contain GPU devices. The last line places the context's devices in an array that will usually contain only one element: the IWebCLDevice, representing the host's GPU.

  2. The context creates an IWebCLProgram from the source code. The application then compiles the program and creates an IWebCLKernel from one of its kernel functions. This can be accomplished as follows:
    Listing 2. Compile the program and create a kernel
    var program_src = "__kernel void basic(__global float4* in_vec,     \
                                           __global float4* out_vec) {  \
                           out_vec[0] = in_vec[0];                      \
                       }";
    var program = ctx.createProgramWithSource(program_src);
    program.buildProgram([devices[0]], "");
    var kernel = program.createKernel("basic");

    It's important to understand the difference between programs and kernels. A program is a body of code that contains one or more functions. A kernel represents a single function in the program's code.

  3. The kernel is created from a function called basic, which accepts two arguments: in_vec and out_vec. Both are stored in the device's global address space, so the application will create two IWebCLMemoryObjects —in_buff and out_buff— and make them kernel arguments. The following code shows how this is done.
    Listing 3. Create memory objects and make them kernel arguments
    var in_buff = ctx.createBuffer(WebCL.CL_MEM_READ_ONLY, 16);
    var out_buff = ctx.createBuffer(WebCL.CL_MEM_WRITE_ONLY, 16);
    kernel.setKernelArg(0, in_buff);
    kernel.setKernelArg(1, out_buff);

    The parameters of createBuffer set properties for the memory object. The first identifies whether the kernel argument provides input (CL_MEM_READ_ONLY) or receives output (CL_MEM_WRITE_ONLY). The second parameter of createBuffer identifies the size of the data in bytes. Each memory object contains four floats of 4 bytes each, so their sizes are set to 16.

  4. Now that the kernel is fully configured, the application creates an IWebCLCommandQueue to transfer commands from the host to the kernel. The following code creates the command queue, writes data to the kernel's input argument (in_vec), and then launches the kernel by calling enqueueTask.
    Listing 4. Create the command queue, write input data, and launch the kernel
    var queue = ctx.createCommandQueue(devices[0], 0);
    var in_data = new Float32Array([1.5, 2.5, 3.5, 4.5]);
    queue.enqueueWriteBuffer(in_buff, false, 0, 16, in_data, []);
    queue.enqueueTask(kernel, []);

    The input data is given as an array of 32-bit floating-point values. Kernel data doesn't have to be given in floating point, but the values must be placed in a similar typed array, such as an Int32Array or a UInt16Array.

  5. To verify that the kernel has been executed correctly, the application reads the data from the kernel's second argument, which is given by the out_buff memory object. The following code reads the data from out_buff into an array called out_data and displays the array's content in the web page.
    Listing 5. Read output data from the device and display it in the page
    out_data = new Float32Array(4);
    queue.enqueueReadBuffer(out_buff, true, 0, 16, out_data, []);      
    var output = document.getElementById("output");
    output.innerHTML = "Output: " + out_data[0] + ", " + out_data[1] + ", ";
    output.innerHTML += out_data[2] + ", " + out_data[3];

    This code displays the output data by accessing an element in the HTML document, called output. To see the page's HTML content, download this article's example code and open simple.html in Firefox.

Work items and enqueueNDRangeKernel

The preceding application executed the kernel by calling enqueueTask. This is fine while learning about WebCL, but professional applications never use this function because enqueueTask executes the kernel with a single thread, which defeats the purpose of using massively multiparallel devices like GPUs.

In WebCL, processing threads are called work items. To understand how they're used, consider the nested loop in Listing 6.

Listing 6. Processing threads
for(i=a; i<A; i++) {
   for(j=b; j<B; j++) {
      for(k=c; k<C; k++) {
         process_data(i, j, k);
      }
   }
}

The process_data function accepts three indices: i, j, and k. Each has a different offset (a, b, and c), and a different maximum limit (A, B, and C). Therefore, the index ranges are A - a, B - b, and C - c. OpenCL terminology refers to the index ranges as sizes and the number of indices as dimensionality. Therefore, this loop has three dimensions, and the dimensions have sizes A - a, B - b, and C - c, respectively.

WebCL's great advantage is its usage of work items to execute iterations of functions like process_data, instead of time-intensive loops. Work items have dimensionality (1, 2, or 3), and each has a unique identifier that corresponds to a specific iteration of the loop. For example, if work items are executed with two dimensions, one work item might execute a kernel with a unique ID of (4, 5), while another executes a kernel with an ID of (5, 4). And remember, these work items execute in parallel.

To configure work items in WebCL, enqueueNDRangeKernel must be executed, instead of enqueueTask. This is the most important function in the WebCL application programming interface, and it accepts six arguments:

  • kernel - The IWebCLKernel corresponding to the function to be executed.
  • dim - The number of dimensions of the work items used to execute the function.
  • offsets - An array containing the initial values of the work items' identifiers.
  • global_sizes - An array containing the total number of work items in each dimension.
  • local_sizes - An array containing the number of work items in a work group in each dimension.
  • wait_list - An array containing the IWebCLEvent structures that make up the command's wait list.

As an example, consider the loop in Listing 7.

Listing 7. Configuring an enqueueNDRangeKernel
for(i=5; i<50; i++) {
   for(j=6; j<60; j++) {
      for(k=7; k<70; k++) {
         process_data(i, j, k);
      }
   }
}

The dimensionality is 3, the offsets are (5, 6, 7), and the work item sizes are (45, 64, 73). To execute process_data using WebCL, you would create an IWebCLKernel for the function and execute the kernel by invoking enqueueNDRangeKernel, as follows:

queue.enqueueTask(kernel, 3, [5, 6, 7], [45, 64, 73], [], []);

The last two arguments are set to empty arrays. Work groups and events are fascinating capabilities of WebCL, but they lie beyond the scope of this article. The next section explains how to code kernel functions.


Kernel coding

A WebCL kernel represents a function intended to be executed on a GPU or other compliant device. Its overall structure is as follows:

__kernel void func_name(parameter_list) {
   ...lines of code...
}

Each kernel declaration starts with __kernel, and its return value is always void. This means that the function's input/output data must be provided through its parameter list. Kernel functions are coded in C99, but developers can access a number of WebCL-specific data types and functions.

Kernel data types

Kernels can access C's basic data types and process data with greater speed using vector data types. These types are given as typen, where type is a basic data type and n is the number of elements of that type in the vector (usually 2, 4, 8, or 16).

For example, a float4 is a vector that contains four floats. If the device supports float4 operations, each operation on the vector will affect all of the vector's elements at once. If not, the kernel compiler will break the operation down into operations the device can support.

Initializing a vector is similar to initializing an array, but parentheses (()) are used, instead of curly brackets ({}). The code in Listing 8. shows how two vectors — an int4 and a char16—a re declared and initialized.

Listing 8. Declaring and initializing two vectors
int4 vector_a = (int4)(1, 2, 3, 4);
char16 vector_b = (char16)('H', 'e', 'l', 'l', 'o',
    'P', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'e', 'r', '!');

Operators and math functions

Kernels can normally access all of the operators used in C— that is, if vec_a and vec_b have the same vector type, vec_a * vec_b returns a vector containing their product. Logical operators are also available, and (5 > 3) returns the same value in a kernel as it does in a regular function. If vectors are involved in a relational operation, the elements of the result vector are set to -1 when the result is true and 0 when the result is false.

If a C function doesn't involve math, logic, or comparison, it probably can't be accessed inside a kernel. But kernels can invoke nearly every function in the standard math library, from acos to sqrt. These functions accept both vectors and scalars. So to compute the absolute value of a float4 called vec, you would call fabs(vec) as if vec were just another number.

Work item functions

When a work item executes a kernel, its first task usually involves determining its ID. It may also access the kernel's dimensionality and the total number of work items executing the kernel. This is accomplished with the following functions:

  • get_global_id(int i) - Returns the work item's ID in dimension i.
  • get_work_dim() - Returns the number of dimensions of the work items executing the kernel.
  • get_global_size(int i) - Returns the number of work items executing the kernel in dimension i.

The kernel function in Listing 9 shows how work item functions are used in practice.

Listing 9. Work item functions in practice
__kernel void example(__global float* in_array, __global float* out_array) {
   int id = get_global_id(0);
   int N = get_global_size(0);
   out_array[id] = in_array[id]/N; 
}

The work items executing this kernel are one-dimensional, and there is one work item for each element in in_array and out_array. When the kernel executes, each work item reads an element of in_array, divides it by the total number of work items, and stores the result in out_array. This, in a nutshell, is how WebCL kernels work. The next section uses a similar process to search for text.


Searching for text with WebCL

Many web applications need to detect patterns within a large body of text as quickly as possible. This type of application is perfect for GPUs, which specialize in brute-force processing of large amounts of data. The text_search project searches for four words (that, very, time, and more) within Amy Lowell's narrative poem "The Great Adventure of Max Breuck." Figure 2 shows how this works.

Figure 2. Text search with OpenCL vectors
Comparing incoming text to a pattern

The algorithm is implemented using three files, all contained in this article's downloadable archive:

  • text_search.html - Contains the HTML for the web page and the JavaScript that deploys the kernel.
  • text_search.cl - Contains the HTML for the web page and the JavaScript that deploys the kernel.
  • poem.txt - Contains the raw text of the poem (with extra characters inserted at the start and end).

Each work item reads 16 characters of the poem and checks for the presence of each of the four words. If a match is detected, the work item atomically increments one of four counters, which will be returned to the host. The function is shown in Listing 10.

Listing 10. Function to increment text search counters
		__kernel void text_search(__global char* text,
                          __global int* match_count) {

   char16 pattern = (char16)('t', 'h', 'a', 't', 'v', 'e', 'r', 'y', 
                             't', 'i', 'm', 'e', 'm', 'o', 'r', 'e');
   char16 test_vector, check_vector;

   /* load global text into private buffer */
   test_vector = vload16(0, text + get_global_id(0));

   /* compare test vector and pattern */
   check_vector = test_vector == pattern;

   /* Check for 'that,' 'very,' 'time,' and 'more' */
   if(all(check_vector.s0123))
      atomic_inc(match_count);
   if(all(check_vector.s4567))
      atomic_inc(match_count + 1);
   if(all(check_vector.s89AB))
      atomic_inc(match_count + 2);
   if(all(check_vector.sCDEF))
      atomic_inc(match_count + 3);
}

Figure 3. depicts the kernel's output in Firefox.

Figure 3. Text search results
Word counts in the Amy Lowell poem

Results can be verified with the grep utility. For example, the command grep -o time poem.txt returns the number of occurrences of the word time in the file poem.txt.

The text_search host application generates 30,865 work items to execute the kernel. Depending on the GPU, not all of them will execute at the same time. This is not an important concern. As soon as one group of work items completes its execution, another group takes its place. Therefore, the answer to the popular question, "How many work items can I execute?" is "As many as you need."


Conclusion: The Future of WebCL

When you consider how much power WebCL provides, it may seem odd that there are so few browser extensions available. The problem is security. As a kernel executes, it can produce errors such as infinite loops or deadlocks. This isn't a big deal for CPUs, but as I've learned from experience, GPU errors can freeze a computer. Ctrl-C and Ctrl-Alt-Delete won't help. The system must be restarted.

To remedy this issue, GPU manufacturers have promised to add greater security to their offerings. This means users will be able to interrupt or halt GPU execution without crashing their system. If this comes into common usage, WebCL will transform from an oddity to a necessity. With WebCL's high-speed execution, browser applications will be able to process data and render graphics with the same performance as regular desktop applications.


Download

DescriptionNameSize
Example code for WebCL developmentwebcl_src.zip10KB

Resources

Learn

Get products and technologies

Discuss

  • The developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Web development on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development, Open source
ArticleID=877542
ArticleTitle=Accelerating web applications with OpenCL
publish-date=04232013