Skip to main content

Fun with ALF, Part 1: Adding large matrices together

Examples show you how to use ALF to add two large matrices together on the host and accelerator partitions

Kane Scarlett, Editor, Multicore acceleration, IBM Japan
Kane Scarlett
Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.

Summary:  In this Cell Broadband Engine™ (Cell/B.E.) series, learn how to use the Accelerated Library Framework (ALF) in the IBM SDK for Multicore Acceleration 3.0 to add two large matrices together. There is one example for host data partitioning and one for accelerator data partitioning. The "ALF for Cell/B.E. Programmer's Guide and API Reference, Version 3.0" (see Resources) is the source for the content.

View more content in this series

Date:  18 Mar 2008
Level:  Introductory
Activity:  2614 views

Introduction

For this article, two large matrices are added together using ALF. The problem can be expressed simply as:

More fun with ALF

Look for more in the Fun with ALF series:

  • In "Table lookup," learn how the task context buffer is used as a large lookup table to convert the 16-bit input data to 8-bit output data.
  • In "Min-max finder," discover how you can use the task context to keep the partial computing results for each task instance and then combine these partial results into the final result.
  • In "Multiple vector dot products," uncover how to use the bundled work block distribution with the task context to handle situations where the work block cannot hold the partitioned data because of a local memory size limit.
  • In "Overlapped I/O buffer," take a look at using overlapped I/O buffers to do matrix addition.
  • In "Task dependency," check out a simple simulation in which you can use task dependency in a two-stage pipeline application.

And watch for similar Fun with series addressing DaCS, BLAS, and other technologies to make your Cell/B.E. programming easier.

A[m,n] + B[m,n] = C[m,n]

where m and n are the dimensions of the matrices.

This simple example, crafted for both host and accelerator data partitioning, demonstrates how to:

  • Start the ALF runtime environment.
  • Use task descriptor.
  • Start a task on the accelerators.
  • Create and add a work block to a task.
  • Exit the ALF runtime environment correctly.

You can use this sample as a template to build more complicated applications.

Exploring the host data partitioning example

In this example, the host application:

  • Initializes the ALF runtime environment.
  • Creates a task descriptor.
  • Creates a task based on that task descriptor.
  • Creates work blocks with the appropriate data transfer lists that start invocations of the computational kernel on the accelerator.
  • Waits for the computational kernel to finish and exits.

The accelerator application includes a simple computational kernel that computes the addition of the two matrices. The scalar code to add two matrices for a uni-processor computer is:

float mat_a[NUM_ROW][NUM_COL]; 
float mat_b[NUM_ROW][NUM_COL]; 
float mat_c[NUM_ROW][NUM_COL]; 
int main(void) 
{
  int i,j;
  for (i=0; i<NUM_ROW; i++)
     for (j=0; j<NUM_COL; j++)
        mat_c[i][j] = mat_a[i][j] + mat_b[i][j];
  return 0; 
}


Exploring the host data partitioning source code

The following code listings show only the relevant sections of the code. For a complete listing, refer to the ALF samples directory matrix_add/STEP1a_partition_scheme_A/common/host_partition.

An ALF host program can be logically divided into these sections:

  • Initialization
  • Task setup
  • Work block setup
  • Task wait and exit

Initialization

The following code segment shows how ALF is initialized and how accelerators are allocated for a specific ALF runtime.

alf_handle_t alf_handle; 
unsigned int nodes; 

/* initializes the runtime environment for ALF*/ 
alf_init(&config_parms, &alf_handle;); 

/* get the number of SPE accelerators available for from the Opteron */ 
rc = alf_query_system_info(alf_handle, ALF_QUERY_NUM_ACCEL, ALF_ACCEL_TYPE_SPE, &nodes;); 

/* set the total number of accelerator instances (in this case, SPE) */ 
/* the ALF runtime will have during its lifetime */ 
rc = alf_num_instances_set (alf_handle, nodes);

Task setup

This section of an ALF host program contains information about the description of a task and the creation of the task runtime. The alf_task_desc_create function creates a task descriptor. This descriptor can be used multiple times to create different executable tasks. The function alf_task_create creates a task to run an SPE program with the name spe_add_program.

/* variable declarations */ 
alf_task_desc_handle_t task_desc_handle; 
alf_task_handle_t task_handle; 
const char* spe_image_name; 
const char* library_path_name; 
const char* comp_kernel_name; 

/* describing a task that's executable on the SPE*/ 
alf_task_desc_create(alf_handle, ALF_ACCEL_TYPE_SPE, &task_desc_handle;); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_TSK_CTX_SIZE, 0); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE,
 sizeof(add_parms_t)); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_IN_BUF_SIZE,
 H * V * 2 sizeof(float)); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_OUT_BUF_SIZE,
 H * V * sizeof(float)); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_NUM_DTL_ENTRIES, 8); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_MAX_STACK_SIZE, 4096); 

/* providing the SPE executable name */ 
alf_task_desc_set_int64(task_desc_handle,
 ALF_TASK_DESC_ACCEL_IMAGE_REF_L,(unsigned long long) spe_image_name); 
alf_task_desc_set_int64(task_desc_handle,
 ALF_TASK_DESC_ACCEL_LIBRARY_REF_L,(unsigned long) library_path_name); 
alf_task_desc_set_int64(task_desc_handle,
 ALF_TASK_DESC_ACCEL_KERNEL_REF_L,(unsigned long) comp_kernel_name);

Work block setup

This section shows how work blocks are created. After the program has created the work block, it describes the input and output associated with each work block. Each work block contains the input description for blocks in the input matrices of size H * V starting at location matrix[row][0] with H and V representing the horizontal and vertical dimensions of the block.

In this example, assume that the accelerator memory can contain the two input buffers of size H * V elements, and assume that the output buffer is of size H * V. The program calls alf_wb_enqueue() to add the work block to the queue to be processed. ALF employs an immediate runtime mode. As soon as the first work block is added to the queue, the task starts processing the work block. The function alf_task_finalize closes the work block queue.

alf_wb_handle_t wb_handle; 
add_parms_t parm __attribute__((aligned(128))); 
parm.h = H; /* horizontal size of the block */ 
parm.v = V; /* vertical size of the block */ 

/* creating work blocks and adding parameter & io buffer */ 
for (i = 0; i < NUM_ROW; i += H) { 
     alf_wb_create(task_handle, ALF_WB_SINGLE, 0,&wb_handle); 

     /* begins a new Data Transfer List for INPUT */ 
     alf_wb_dtl_set_begin(wb_handle, ALF_BUF_IN, 0); 

     /* Add H*V element of mat_a as Input */ 
     alf_wb_dtl_set_entry_add(wb_handle, &matrix_a[i][0], H * V, ALF_DATA_FLOAT); 

     /* Add H*V element of mat_b as Input */ 
     alf_wb_dtl_set_entry_add(wb_handle, &matrix_b[i][0], H * V, ALF_DATA_FLOAT); 
     alf_wb_dtl_set_end(wb_handle); 

     /* begins a new Data Transfer List OUTPUT */ 
     alf_wb_dtl_set_begin(wb_handle, ALF_BUF_OUT, 0); 

     /* Add H*V element of mat_c as Output */ 
     alf_wb_dtl_set_entry_add(wb_handle, &matrix_c[i][0], H * V, ALF_DATA_FLOAT); 
     alf_wb_dtl_set_end(wb_handle); 

     /* pass parameters H and V to spu */ 
     alf_wb_parm_add(wb_handle, (void *) (&parm), sizeof(parm), ALF_DATA_BYTE, 0); 

     /* enqueuing work block */ 
     alf_wb_enqueue(wb_handle); 
} 
alf_task_finalize(task_handle);

Task wait and exit

After all the work blocks are on the processing queue, the program waits for the accelerator to finish processing the work blocks. Then alf_exit() is called to cleanly exit the ALF runtime environment.

/* waiting for all work blocks to be done*/ 
alf_task_wait(task_handle, -1); 
/* exit ALF runtime */ 
alf_exit(alf_handle, ALF_EXIT_WAIT, -1);


Preparing the accelerator side

On the accelerator side, you need to provide the actual computational kernel that computes the addition of the two blocks of matrices. The ALF runtime on the accelerator is responsible for getting the input buffer to the accelerator memory before it runs the user-provided alf_accel_comp_kernel function. After alf_accel_comp_kernel returns, the ALF runtime is responsible for getting the output data back to host memory space. Double-buffering or triple-buffering is employed as appropriate to ensure that the latency for the input buffer to get into accelerator memory and the output buffer to get to host memory space is well covered with computation.

int alf_accel_comp_kernel(void *p_task_context, 
            void *p_parm_context, 
            void *p_input_buffer, 
            void *p_output_buffer, 
            void *p_inout_buffer, 
            unsigned int current_count, 
            unsigned int total_count) 
{
 unsigned int i, cnt;
  vector float *sa, *sb, *sc;
  add_parms_t *p_parm = (add_parms_t *)p_parm_context;
  cnt = p_parm->h * p_parm->v / 4;
  sa = (vector float *) p_input_buffer;
  sb = sa + cnt;
  sc = (vector float *) p_output_buffer;
  for (i = 0; i < cnt; i += 4) { 
      sc[i] = spu_add(sa[i], sb[i]);
      sc[i + 1] = spu_add(sa[i + 1], sb[i + 1]);
      sc[i + 2] = spu_add(sa[i + 2], sb[i + 2]);
      sc[i + 3] = spu_add(sa[i + 3], sb[i + 3]);
     }
  return 0;
}


Exploring the accelerator data partitioning example

It's a trick. You're really finished. The code remains the same on the host, except for the work block creation. Oh, also, the code needs to specify that it uses accelerator data partitioning in the task descriptor. An implementation for the alf_accel_input_dtl_prepare and alf_accel_output_dtl_prepare functions is required.

For a complete listing of this sample, refer to the ALF samples directory matrix_add/common/accel_partitioning.


Resources

Learn

Get products and technologies

Discuss

About the author

Kane Scarlett

Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=295429
ArticleTitle=Fun with ALF, Part 1: Adding large matrices together
publish-date=03182008
author1-email=kane@us.ibm.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers