IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
Fun with ALF, Part 1: Adding large matrices together
skip to main content

developerWorks  >  Power Architecture technology  >

Fun with ALF, Part 1: Adding large matrices together

Examples show you how to use ALF to add two large matrices together on the host and accelerator partitions

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Introductory

Kane Scarlett (kane@us.ibm.com), developerWorks Editor, IBM

18 Mar 2008

In this Cell Broadband Engine™ (Cell/B.E.) series, learn how to use the Accelerated Library Framework (ALF) in the IBM SDK for Multicore Acceleration 3.0 to add two large matrices together. There is one example for host data partitioning and one for accelerator data partitioning. The "ALF for Cell/B.E. Programmer's Guide and API Reference, Version 3.0" (see Resources) is the source for the content.

Introduction

For this article, two large matrices are added together using ALF. The problem can be expressed simply as:

More fun with ALF

Look for more in the Fun with ALF series:

  • In "Table lookup," learn how the task context buffer is used as a large lookup table to convert the 16-bit input data to 8-bit output data.
  • In "Min-max finder," discover how you can use the task context to keep the partial computing results for each task instance and then combine these partial results into the final result.
  • In "Multiple vector dot products," uncover how to use the bundled work block distribution with the task context to handle situations where the work block cannot hold the partitioned data because of a local memory size limit.
  • In "Overlapped I/O buffer," take a look at using overlapped I/O buffers to do matrix addition.
  • In "Task dependency," check out a simple simulation in which you can use task dependency in a two-stage pipeline application.
And watch for similar Fun with series addressing DaCS, BLAS, and other technologies to make your Cell/B.E. programming easier.

A[m,n] + B[m,n] = C[m,n]

where m and n are the dimensions of the matrices.

This simple example, crafted for both host and accelerator data partitioning, demonstrates how to:

  • Start the ALF runtime environment.
  • Use task descriptor.
  • Start a task on the accelerators.
  • Create and add a work block to a task.
  • Exit the ALF runtime environment correctly.

You can use this sample as a template to build more complicated applications.

Exploring the host data partitioning example

In this example, the host application:

  • Initializes the ALF runtime environment.
  • Creates a task descriptor.
  • Creates a task based on that task descriptor.
  • Creates work blocks with the appropriate data transfer lists that start invocations of the computational kernel on the accelerator.
  • Waits for the computational kernel to finish and exits.

The accelerator application includes a simple computational kernel that computes the addition of the two matrices. The scalar code to add two matrices for a uni-processor computer is:

float mat_a[NUM_ROW][NUM_COL]; 
float mat_b[NUM_ROW][NUM_COL]; 
float mat_c[NUM_ROW][NUM_COL]; 
int main(void) 
{
  int i,j;
  for (i=0; i<NUM_ROW; i++)
     for (j=0; j<NUM_COL; j++)
        mat_c[i][j] = mat_a[i][j] + mat_b[i][j];
  return 0; 
}



Back to top


Exploring the host data partitioning source code

The following code listings show only the relevant sections of the code. For a complete listing, refer to the ALF samples directory matrix_add/STEP1a_partition_scheme_A/common/host_partition.

An ALF host program can be logically divided into these sections:

  • Initialization
  • Task setup
  • Work block setup
  • Task wait and exit

Initialization

The following code segment shows how ALF is initialized and how accelerators are allocated for a specific ALF runtime.

alf_handle_t alf_handle; 
unsigned int nodes; 

/* initializes the runtime environment for ALF*/ 
alf_init(&config_parms, &alf_handle;); 

/* get the number of SPE accelerators available for from the Opteron */ 
rc = alf_query_system_info(alf_handle, ALF_QUERY_NUM_ACCEL, ALF_ACCEL_TYPE_SPE, &nodes;); 

/* set the total number of accelerator instances (in this case, SPE) */ 
/* the ALF runtime will have during its lifetime */ 
rc = alf_num_instances_set (alf_handle, nodes);

Task setup

This section of an ALF host program contains information about the description of a task and the creation of the task runtime. The alf_task_desc_create function creates a task descriptor. This descriptor can be used multiple times to create different executable tasks. The function alf_task_create creates a task to run an SPE program with the name spe_add_program.

/* variable declarations */ 
alf_task_desc_handle_t task_desc_handle; 
alf_task_handle_t task_handle; 
const char* spe_image_name; 
const char* library_path_name; 
const char* comp_kernel_name; 

/* describing a task that's executable on the SPE*/ 
alf_task_desc_create(alf_handle, ALF_ACCEL_TYPE_SPE, &task_desc_handle;); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_TSK_CTX_SIZE, 0); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE,
 sizeof(add_parms_t)); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_IN_BUF_SIZE,
 H * V * 2 sizeof(float)); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_OUT_BUF_SIZE,
 H * V * sizeof(float)); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_NUM_DTL_ENTRIES, 8); 
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_MAX_STACK_SIZE, 4096); 

/* providing the SPE executable name */ 
alf_task_desc_set_int64(task_desc_handle,
 ALF_TASK_DESC_ACCEL_IMAGE_REF_L,(unsigned long long) spe_image_name); 
alf_task_desc_set_int64(task_desc_handle,
 ALF_TASK_DESC_ACCEL_LIBRARY_REF_L,(unsigned long) library_path_name); 
alf_task_desc_set_int64(task_desc_handle,
 ALF_TASK_DESC_ACCEL_KERNEL_REF_L,(unsigned long) comp_kernel_name);

Work block setup

This section shows how work blocks are created. After the program has created the work block, it describes the input and output associated with each work block. Each work block contains the input description for blocks in the input matrices of size H * V starting at location matrix[row][0] with H and V representing the horizontal and vertical dimensions of the block.

In this example, assume that the accelerator memory can contain the two input buffers of size H * V elements, and assume that the output buffer is of size H * V. The program calls alf_wb_enqueue() to add the work block to the queue to be processed. ALF employs an immediate runtime mode. As soon as the first work block is added to the queue, the task starts processing the work block. The function alf_task_finalize closes the work block queue.

alf_wb_handle_t wb_handle; 
add_parms_t parm __attribute__((aligned(128))); 
parm.h = H; /* horizontal size of the block */ 
parm.v = V; /* vertical size of the block */ 

/* creating work blocks and adding parameter & io buffer */ 
for (i = 0; i < NUM_ROW; i += H) { 
     alf_wb_create(task_handle, ALF_WB_SINGLE, 0,&wb_handle); 

     /* begins a new Data Transfer List for INPUT */ 
     alf_wb_dtl_set_begin(wb_handle, ALF_BUF_IN, 0); 

     /* Add H*V element of mat_a as Input */ 
     alf_wb_dtl_set_entry_add(wb_handle, &matrix_a[i][0], H * V, ALF_DATA_FLOAT); 

     /* Add H*V element of mat_b as Input */ 
     alf_wb_dtl_set_entry_add(wb_handle, &matrix_b[i][0], H * V, ALF_DATA_FLOAT); 
     alf_wb_dtl_set_end(wb_handle); 

     /* begins a new Data Transfer List OUTPUT */ 
     alf_wb_dtl_set_begin(wb_handle, ALF_BUF_OUT, 0); 

     /* Add H*V element of mat_c as Output */ 
     alf_wb_dtl_set_entry_add(wb_handle, &matrix_c[i][0], H * V, ALF_DATA_FLOAT); 
     alf_wb_dtl_set_end(wb_handle); 

     /* pass parameters H and V to spu */ 
     alf_wb_parm_add(wb_handle, (void *) (&parm), sizeof(parm), ALF_DATA_BYTE, 0); 

     /* enqueuing work block */ 
     alf_wb_enqueue(wb_handle); 
} 
alf_task_finalize(task_handle);

Task wait and exit

After all the work blocks are on the processing queue, the program waits for the accelerator to finish processing the work blocks. Then alf_exit() is called to cleanly exit the ALF runtime environment.

/* waiting for all work blocks to be done*/ 
alf_task_wait(task_handle, -1); 
/* exit ALF runtime */ 
alf_exit(alf_handle, ALF_EXIT_WAIT, -1);



Back to top


Preparing the accelerator side

On the accelerator side, you need to provide the actual computational kernel that computes the addition of the two blocks of matrices. The ALF runtime on the accelerator is responsible for getting the input buffer to the accelerator memory before it runs the user-provided alf_accel_comp_kernel function. After alf_accel_comp_kernel returns, the ALF runtime is responsible for getting the output data back to host memory space. Double-buffering or triple-buffering is employed as appropriate to ensure that the latency for the input buffer to get into accelerator memory and the output buffer to get to host memory space is well covered with computation.

int alf_accel_comp_kernel(void *p_task_context, 
            void *p_parm_context, 
            void *p_input_buffer, 
            void *p_output_buffer, 
            void *p_inout_buffer, 
            unsigned int current_count, 
            unsigned int total_count) 
{
 unsigned int i, cnt;
  vector float *sa, *sb, *sc;
  add_parms_t *p_parm = (add_parms_t *)p_parm_context;
  cnt = p_parm->h * p_parm->v / 4;
  sa = (vector float *) p_input_buffer;
  sb = sa + cnt;
  sc = (vector float *) p_output_buffer;
  for (i = 0; i < cnt; i += 4) { 
      sc[i] = spu_add(sa[i], sb[i]);
      sc[i + 1] = spu_add(sa[i + 1], sb[i + 1]);
      sc[i + 2] = spu_add(sa[i + 2], sb[i + 2]);
      sc[i + 3] = spu_add(sa[i + 3], sb[i + 3]);
     }
  return 0;
}



Back to top


Exploring the accelerator data partitioning example

It's a trick. You're really finished. The code remains the same on the host, except for the work block creation. Oh, also, the code needs to specify that it uses accelerator data partitioning in the task descriptor. An implementation for the alf_accel_input_dtl_prepare and alf_accel_output_dtl_prepare functions is required.

For a complete listing of this sample, refer to the ALF samples directory matrix_add/common/accel_partitioning.

Share this...

digg Digg this story
del.icio.us Post to del.icio.us
Slashdot Slashdot it!



Resources

Learn

Get products and technologies

Discuss


About the author

Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top


IBM and Power Architecture are trademarks of IBM Corporation in the United States, other countries, or both. Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc. Other company, product, or service names may be trademarks or service marks of others. Other company, product, or service names may be trademarks or service marks of others.


    About IBMPrivacyContact