Skip to main content

skip to main content

developerWorks  >  Power Architecture technology | Linux  >

Minimize recoding impact, Part 2: Removing obstacles to speedy performance

Eliminate performance roadblocks as you integrate Cell/B.E. functionality into existing projects

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Intermediate

Jonathan Bartlett (johnnyb@eskimo.com), Director of Technology, New Medio

16 Oct 2007

The first article in the series describes how to do a basic port to the Cell Broadband Engine process. This second article goes further in hammering out the details, including removing limitations based on DMA-transfer size, partitioning the program across multiple SPEs, and improving the program's speed even more.

In the first article in the series, you learned the basic methodology for enhancing an application for the Cell/B.E. processor by offloading computation-intensive functions to the SPE. You developed a non-intrusive framework for doing this, which enables you to add SPE-specific code to multiplatform projects without adversely impacting the other platforms or the build system. However, the system that resulted from the article was suboptimal in several respects:

  • It only accepted value lists that were multiples of 4.
  • It accepted a maximum of 4,096 values.
  • It did not vectorize on the SPE.
  • It only used one SPE.

This article makes the code that you developed in the first article both faster and more robust.

Using basic macros

Before I begin, I'm going to define some basic macros that you will need to use often in the code for alignments, comparisons, dereferencing, and other operations. Some of these were introduced in the first article.


Listing 1. Additional macros for speport.h
                
/* Alignment macros */
#define SPE_ALIGNMENT 16
#define SPE_ALIGNMENT_FULL 128
#define SPE_ALIGN __attribute__((aligned(16)))
#define SPE_ALIGN_FULL __attribute__((aligned(128)))

/* Rounding macros */

/* Go _up_ to the next address based on an alignment */
#define ROUND_UP_ALIGN(value, alignment) (((value) + \
 ((alignment) - 1))&(~((alignment)-1)))

/* Go _down_ to the next address based on an alignment */
#define ROUND_DOWN_ALIGN(value, alignment) ((value)&(~((alignment) - 1)))

/* Divide and round up */
#define DIVIDE_ROUND_UP(value, divisor) (((value) + (divisor) - 1) / (divisor))

/* In case these aren't already defined */
#ifndef MIN
#define MIN(X,Y) (((X) < (Y)) ? (X) : (Y))
#endif
#ifndef MAX
#define MAX(X,Y) (((X) > (Y)) ? (X) : (Y))
#endif  

/* Simplify Dereferencing on the SPE (to deref other types, change the typecase at the */
/* beginning) */
#define SPE_DEREF_UINT32(base, offset) *((unsigned int *)(((char *)(base)) + (offset))) 

/* Give a 16-byte structure for "status" so that it can be used in an array without */
/* causing alignment errors */
typedef struct {
	int status SPE_ALIGN;
} spe_remote_function_status_t;

The importance of these functions will make sense as you proceed through the examples.



Back to top


Getting rid of the limits

The first and most obvious improvement that you could do to the code is to get rid of the limitations. Currently it is limited to value lists whose size are a multiple of 4 values (4 * 4 byte floats = 16 bytes) due to the alignment constraints of SPE DMA transfers. The size limitations of SPE DMA transfers limit the program to a maximum of 4,096 values. Neither of these is good.

The easiest limitation to fix is the size limitation. You can fix the size limitation by simply putting the SPE code into a while loop that processes a piece of the list at a time. This loop looks like this (should be in spe_std_dev.spuc toward the beginning of the function spe_calculate_standard_deviation):


Listing 2. Adding a loop to the program
                 
float spe_calculate_standard_deviation(int num_values, float *values_ea) {
	/* Previous Data Definitions Go Here */
	/* ..... */

	/* Calculate the number of iterations we are going to need */
	int num_values_remaining = num_values;
	float *current_values_ea = values_ea;

	while(num_values_remaining > 0) {
		int values_this_cycle = MIN(num_values_remaining, MAX_VALUES);

		/* Previous code goes here */
		/* Replace values_ea with current_values_ea */
		/* Replace num_values with values_this_cycle */
		/* ..... */

		num_values_remaining -= values_this_cycle;
		current_values_ea += values_this_cycle;

	}

	/* Rest of function from original */

	/* Finalize Computation */
	avg = sum / (float)num_values;
	variance = (sum_squares - (sum * avg)) / (float)num_values;
	std_dev = sqrt(variance);

	return std_dev;
}
                  

By adding that loop, you removed the maximum size limit to the number of values that you can process. Now you need to remove the condition in main to check for the number of values to process (the condition begins with the comment /* Check boundaries */; the else branch should be preserved).

However, you still have a problem with the number of values: it still must be a multiple of 4. To get around this, you have to DMA up to 3 values individually. The reason for this is that the DMA system can either transfer multiples of 16 bytes or transfer a naturally aligned value. That means that transferring a single four-byte value can be done as long as it is aligned to four bytes. The alignment is easy to guarantee. You already replaced the malloc with one that allocates all values on a sixteen-byte boundary.

For the transfer size, there are two ways to get the transfer integrated into the program. The first way is to modify the loop itself so that it slows down to one value at a time when it hits the unaligned portion of the list. The other way is to tack on an additional loop to handle the special cases. Also, it has to be able to handle alignment issues on the front side as well as on the back side (this will be more important when you partition lists).

To accomplish this, rewrite your loop to slow down on edge cases:


Listing 3. Modifying the loop to handle non-DMA-aligned values
                 
while(num_values_remaining > 0) {
		int values_this_cycle = MIN(num_values_remaining, MAX_VALUES);

		/* Check to see if we the leading values are unaligned.  If so, go one */
            /* value at a time until they are aligned. */
		if(((int)current_values_ea % DMA_TRANSFER_UNIT) != 0) {
			values_this_cycle = 1;
		} else {
			/* Check to see if this is a size that is transferable in one block */
			if((values_this_cycle % DMA_TRANSFER_UNIT_FLOAT) != 0) {
				/* Check to see if we can transfer a larger sub-block in a */
                        /* single transfer */
				int tmp_num_values = values_this_cycle - (values_this_cycle %
                         DMA_TRANSFER_UNIT_FLOAT);
				if(tmp_num_values > 0) {
					/* If so, use this value */
					values_this_cycle = tmp_num_values;
				} else {
					/* Otherwise, just go one value at a time */
					values_this_cycle = 1;
				}
			}
		}

		/* Load values in from main memory pointer */

		/* If we are doing one at a time, then we need to check for */
            /* alignment issues */
		if(values_this_cycle == 1) {
			/* Unaligned addresses require similarly unaligned destinations - */
                  /* calculate the offset */
			int natural_offset = (((int)current_values_ea / sizeof(float)) %
                   DMA_TRANSFER_UNIT_FLOAT);

			/* Perform the Transfer */
			spu_mfcdma32(&ls_values[natural_offset],
                   (unsigned int)current_values_ea, values_this_cycle * sizeof(float),
                   DEFAULT_DMA_TAG, MFC_GET_CMD);
			spu_mfcstat(MFC_TAG_UPDATE_ALL);

			/* Move it back into position for processing */
			ls_values[0] = ls_values[natural_offset];
		} else {
			/* Regular properly-aligned, properly-sized transfer */
			spu_mfcdma32(ls_values, (unsigned int)current_values_ea,
                   values_this_cycle * sizeof(float), DEFAULT_DMA_TAG, MFC_GET_CMD);
			spu_mfcstat(MFC_TAG_UPDATE_ALL);
		}


		for(i = 0; i < values_this_cycle; i++) {
			sum += ls_values[i];
			sum_squares += ls_values[i]*ls_values[i];
		}

		num_values_remaining -= values_this_cycle;
		current_values_ea += values_this_cycle;
	}
                  

This requires a few extra defines to be placed in speport.h:

#define DMA_TRANSFER_UNIT 16
#define DMA_TRANSFER_UNIT_FLOAT (DMA_TRANSFER_UNIT / sizeof(float))

As you can see, what this is doing is slowing down and speeding up DMA transfers, depending on the location and size of the block in memory. While this still requires natural alignment, even if the values are not properly DMA aligned, it is no longer a big deal. However, the Cell/B.E. processor requires that the last four bits match in the source and destination register, which is why there are some acrobatics for single-byte transfers. The DMA transfer speeds up when it is properly aligned enough to do large transfers, and it slows down when it is not. The data still must be four-byte aligned, but that should be a fairly easy requirement to meet. Also, the function will be much faster if the data is aligned on a 16-byte-boundary or even more for a 128-byte boundary.



Back to top


Using multiple SPEs

In the previous section, you removed the operating limits of the function. In this section, you are going to take advantage of the remaining computing power of the Cell/B.E. processor by dividing the task across multiple SPEs. Not all tasks are easily divisible among SPEs, but this one is especially easy to partition.

There are basically two parts to the program: summation (getting sums and sums of squares) and finalization (computing the final answer from the sums). Finalization is trivial, and therefore, it can be just as easily done on the PPE. Summation, because addition is commutative, can be divided up any way you like. Often an algorithm, which in its naive form does not have commutative components, can be rearranged so that its parts are commutative.

So you are going to remove the finalization from the SPE portion of the program and make it only add sums and sums of squares. The PPE will be responsible for spawning the processes, dividing up the work, collecting the results, and finalizing the solution. This requires changes to your SPE application (it needs to now return the sums rather than the final result) as well as changes to your driver program (it needs to spawn multiple instances and collect and process the answers).

The first thing you need to do is redefine the struct that is passed between the PPE and the SPE in my_math_spe.h.


Listing 4. Struct for passing partial results
                 
#ifndef MY_MATH_SPE_H
#define MY_MATH_SPE_H

#include "speport.h"

typedef struct {
	int num_values SPE_ALIGN;
	float *values SPE_ALIGN;
	float result_sum SPE_ALIGN;
	float result_sum_squares SPE_ALIGN;
} spe_std_dev_params_t;

#endif   

Now, for the SPE, you need to make a few simple changes to the way the processing function works. First, you need to redefine (and rename, because it will not perform the final calculations) the processing function in spe_std_dev.spuc so that it returns two floats rather than one. The new prototype should be:

 
void spe_calculate_sums(int num_values, float *values_ea, float *result_sum,
 float *result_sum_squares);
                  

Similarly the function definition should be:

 
void spe_calculate_sums(int num_values, float *values_ea, float *result_sum, float
 *result_sum_squares) { 
                  

Because you are returning more than one value, change the return value to a void and simply pass the addresses to the variables at the end. This needs to be altered both in the prototype declaration at the beginning of the file and in the function definition itself. Then, when you call the function, you need to pass it pointers to the two values in the struct:

 
/* Perform Task */
spe_calculate_sums(params.num_values, params.values, &params.result_sum,
&params.result_sum_squares);
                  

Then, at the end of the spe_calculate_sums, you need to set these return values instead of finalizing the calculation (the PPE will do the finalization):

 
*result_sum = sum;
*result_sum_squares = sum_squares;
return;
                  

The variables avg, variance, and std_dev are no longer needed, and their declarations can be removed.

Now the hard part: getting the PPE to split the job up appropriately.

Basically what you are going to do is to have an array of SPE contexts and a matching array of parameter structs:

  1. On the first run, get all of the contexts initialized.
  2. On every run, take the problem, split it up into work units, and then set each of the SPEs off and running with its own set of values to process.
  3. Wait for each of them to finish.
  4. As they finish, add the individual sums of the individual pieces up until you have all of the values summed together.

At this point, you have the information you need to perform the final calculation and return the answer.

Here is the new implementation of calculate_standard_deviation on the PPE (in my_math.c):


Listing 5. Adapting the PPE function to multiple SPEs
                 
                        /* ... header files ... */
#ifndef USE_SPE
/* Non-SPE Version */
/* ... */
#else
/* SPE-specific includes */
#include "speport.h"
#include "my_math_spe.h"

#define NUM_SPE_THREADS 4
float calculate_standard_deviation(int num_values, float *values) {
	/* Keep SPE Thread Contexts Here */
	static spe_remote_function_ptr_t std_dev_func[NUM_SPE_THREADS] =
       {NULL, NULL, NULL, NULL};

	/* Parameters to pass to contexts */
	spe_std_dev_params_t params[NUM_SPE_THREADS] SPE_ALIGN;
	spe_remote_function_status_t status[NUM_SPE_THREADS] SPE_ALIGN;

	/* Working variables */
	int thread_idx;
	int workunit_size = ROUND_UP_ALIGN(DIVIDE_ROUND_UP(num_values, NUM_SPE_THREADS),
       (SPE_ALIGNMENT_FULL/sizeof(float)));
	int current_workunit_size;
	int remaining_values = num_values;
	float *current_value_position = values;
	float sum = 0.0, sum_squares = 0.0, avg, variance, std_dev;

	/* Start up SPE processes if this is our first run */
	if(std_dev_func[0] == NULL) {
		for(thread_idx = 0; thread_idx < NUM_SPE_THREADS; thread_idx++) {
			std_dev_func[thread_idx] = spe_remote_function_start("./spe_std_dev",
                   NULL);
			if(std_dev_func[thread_idx] == NULL) {
				fprintf(stderr, "Error starting thread!\n");
				exit(1);
			}
		}
	}

	/* Split up the Data and Call the Function */
	for(thread_idx = 0; thread_idx < NUM_SPE_THREADS; thread_idx++) {
		current_workunit_size = MIN(workunit_size, remaining_values);

		/* Make parameters */
		params[thread_idx].num_values = current_workunit_size;
		params[thread_idx].values = current_value_position;

		/* Call the function */
		if(spe_remote_call(std_dev_func[thread_idx], &params[thread_idx],
             SPE_RUN_NONBLOCK, &status[thread_idx]) < 0) {
			fprintf(stderr, "Error running function\n");
			exit(1);
		}

		/* Move counter and pointer for next function call */
		remaining_values -= current_workunit_size;
		current_value_position += current_workunit_size;
	}

	/* Gather the results */
	for(thread_idx = 0; thread_idx < NUM_SPE_THREADS; thread_idx++) {
		/* Wait for thread to complete */
		spe_wait_completion(&status[thread_idx], 0);

		/* Gather data from parameters */
		sum += params[thread_idx].result_sum;
		sum_squares += params[thread_idx].result_sum_squares;
	}
 
	/* Finalize calculation */
	avg = sum / (float)num_values;
	variance = (sum_squares - (sum * avg)) / (float)num_values;
	std_dev = sqrt(variance);

	/* Return result */
	return std_dev;
}
#endif

As you can see, it is a bit of work, but it makes the function very fast. On large datasets, this version of the function is 4.25 times the speed of the PPE-only function. Not only that, if you had another processing job to do simultaneously on the PPE, you could do that at the same time, because this requires nothing of the PPE.

You might be wondering about the calculation of workunit size at the beginning of the function. Basically, you are dividing your work units among the available SPEs, while trying to keep each work unit as a multiple of 128 so that it operates best with the DMA. Without ROUND_UP_ALIGN, the present code can go as slow as half speed when working on a dataset that isn't a multiple of 16 values. It can go a little slower when the dataset isn't a multiple of 128 values.



Back to top


Performing other optimizations

There are several other optimizations that you can do to speed it up even more:

  • Double-buffer the SPE function.
  • Use SPE vector floating-point intrinsics in the SPE portion of the code.
  • Unroll the core loop and schedule instruction.

The first way to optimize is to double-buffer the SPE function. Right now, you are wasting a lot of time waiting around for DMA to finish. If you double-buffered, you could simultaneously load and process data. For more information about double-buffering, see Programming high-performance applications on the Cell BE processor, Part 6: Smart Buffer Management with DMA transfers.

A second way to optimize is to use SPE vector floating-point intrinsics in the SPE portion of the code. Sometimes the compiler is able to detect parallelizable code and write the appropriate vector operations for you. In this case, though, it is likely that because the sum and sum_squares variables are being modified with each iteration, the compiler does not recognize that the operations are relatively independent of each other and, therefore, serializes the operations. By explicitly coding the SIMD operations, the code will be much, much faster. For more about the SPE's vector intrinsics, see Programming high-performance applications on the Cell BE processor, Part 5: Programming the SPU in C/C++.

A third way to optimize is to unroll the core loop and schedule instruction. By unrolling the loop, you can separate out all of the operations that depend on instructions that have not yet completed, and you can organize them so that other instructions could be executed while waiting for the dependencies to finish calculating. This procedure is described in the article Programming high-performance applications on the Cell BE processor, Part 4: Program the SPU for performance.



Back to top


Conclusion

Now you should be able to apply these procedures to any projects and parallelize your applications for optimal performance on the Cell B./E. processor without adversely affecting the build system or the function interface of the rest of your application. You should now know the basics of how to program SPE applications that avoid the basic SPE limitations, such as DMA transfer size, and also how to use the Cell B./E. to its maximum potential by splitting data sets across multiple SPEs. Each application has slightly different needs in order to successfully partition the data, but hopefully this article gave you a good idea of how to do it generally.

Share this...

digg Digg this story
del.icio.us Post to del.icio.us
Slashdot Slashdot it!



Resources

Learn

Get products and technologies

Discuss


About the author

Jonathan Bartlett is the author of the book Programming from the Ground Up which is an introduction to programming using Linux assembly language. He is the lead developer at New Medio, developing Web, video, kiosk, and desktop applications for clients.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top


IBM is a trademark of IBM Corporation in the United States, other countries, or both. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Other company, product, or service names may be trademarks or service marks of others.