Skip to main content

skip to main content

developerWorks  >  Power Architecture technology | Linux  >

Minimize recoding impact, Part 1: How to make an SPE and existing code work together

Introducing a series to demonstrate how to integrate Cell/B.E. functionality into existing projects

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Intermediate

Jonathan Bartlett, Director of Technology, New Medio

04 Sep 2007

Traditional porting requires identifying and abstracting out the architecture-dependent code: making code endian-independent, working through minor API differences, and including the appropriate header files and libraries. While this procedure works for getting code to run on the Cell Broadband Engine™ (Cell/B.E.) processor, to actually use the extra processing elements, you have to put in extra work, including reworking the code and rethinking the build process. In this series, learn to take advantage of the Synergistic Processor Elements (SPEs) in existing code and only make a minimal impact to the existing code and build process.

Because the Cell/B.E. processor includes a Power Processing Element (called the PPE), any program can be easily ported using the same procedures that would be used for any other PowerPC®-based processor. However, this leaves all of the Synergistic Processing Elements (the SPEs) completely unutilized. Several issues must be dealt with when porting an application to take advantage of the SPEs:

  • The PPE and the SPEs use different instruction architectures and thus require different compilers.
  • The PPE is the only processing element that can directly talk to the operating system and manage system resources; the SPEs must all communicate with the operating system through the PPE.
  • The SPEs only have direct access to a 256KB local store— they cannot address main memory outside of this local store directly, but instead must manually load and unload data in and out of the local store using DMA transfers or other mechanisms.
  • Because the PPE and SPE use different instruction architectures, the build process must separate the code for each architecture into different programs (usually one PPE program and one or more SPE programs) and then optionally embed the SPE programs into the PPE executable (this embedding can also take place at run time).
  • Because the SPE is actually running a different program, communication between functions on the PPE and the SPEs will look a lot different since it is now an out-of-process procedure call.

The following article examines some ways that programmers can take advantage of the SPEs in existing code with minimal impact to the code and build process.

Identifying optimal workloads for the SPEs

The first thing to do is to identify code that you want to run on the SPEs. The three things that need to be considered for this are:

  1. The ability of the SPE to process the workload optimally
  2. The ease at which the code can be ported to the SPE
  3. The ability for the code to be parallelized in itself or with the rest of the program

Of these, the ability for the code to be parallelized is probably least important. The fact is that in most running conditions on UNIX® operating systems, the PPE will always have plenty available to do; therefore, even if the amount of processor time used is the same using and not using the SPE, if the program is running on the SPE, then the PPE is freed to run other programs.

The biggest speed obstacle for the SPEs is code with a lot of branches. On the SPE, only one branch hint can be active at a time. The SPE has no hardware for branch prediction (it simply predicts, in the absence of a hint, that the branch will not be taken), and a mispredicted branch can cost 18 cycles. This means that code with lots of vtable lookups (like object-oriented code), lots of function calls, or lots of conditionals could in fact run slower on the SPE than on the PPE.

In addition, the SPE's power comes in the fact that it utilizes a SIMD architecture—it processes 128 bits at a time. Therefore, if processing multiple values, it is best if all of the values can be processed through the same instructions. That is, if you are processing an array of values, you would want to have all of the values processable by the exact same set of statements—not having "if" statements change the direction of the processing. For more on methods for performance SPE programming, see Resources at the end of the article.

Another issue is the data. The SPE cannot access main memory directly; it must be moved in and out with DMA transfers. This means that pointers to main memory cannot be just dereferenced. The data must be explicitly transferred to the local store before it is evaluated. Essentially all main memory pointer lookups must be handled by explicit loading and unloading instructions. This is not only difficult for the programmer, it also does not utilize the SPE's resources efficiently.

Therefore, optimal data structures for SPE processing are structures which have the following properties:

  • They can be processed a piece at a time (thus allowing a double-buffering optimization for the DMA transfers).
  • The data is in one or a few contiguous blocks (so that large chunks can be imported with a few instructions).
  • The blocks of data can be processed independently (allowing the SPE to simply run through the structure and process it without having to load and unload portions of main memory continuously).
  • The data in each block can be processed in parallel (to take advantage of the SIMD instruction set).
  • A structure of arrays works much better than an array of structures (a structure of arrays makes it much easier to pack a single register with multiple values of the same type for SIMD processing).

This doesn't mean that if your data structures don't possess these properties you are out of luck. These are simply what will allow you to get the most mileage out of the SPE. If you are trying to choose which parts of your program to offload to an SPE or how to rearrange your data structures to get better performance, these guidelines will go a long way to helping you make good design choices.

Here's a sample program to port

The application that we will port is a little contrived, but it should serve to point out most of the issues involved in porting. The application will simply take a file of floating-point numbers (the first number is an integer telling how many numbers are in the file), compute the standard deviation of the numbers in that file, and print it out. Even though this could easily be a one-file project, I will break it out into multiple files to help show what kind of changes would be required in larger-scale projects.


Listing 1. The standard deviation header file (my_math.h)
                
#ifndef MY_MATH_H
#define MY_MATH_H

float calculate_standard_deviation(int num_values, float *values);

#endif


Listing 2. The standard deviation function (my_math.c)
                
#include <math.h>
#include <stdlib.h>
#include "my_math.h"

float calculate_standard_deviation(int num_values, float *values) {
	int i; /* counter */
	float sum = 0.0, sum_squares = 0.0;
	float avg, variance, std_dev;

	/* Loop through all the values */
	for(i = 0; i < num_values; i++) {
		sum += values[i];
		sum_squares += values[i]*values[i];
	}
	
	avg = sum / (float)num_values;
	variance = (sum_squares - (sum * avg)) / (float)num_values;
	std_dev = sqrt(variance);
	
	return std_dev;
}


Listing 3. The main control program (main.c)
                
#include <stdio.h>
#include <stdlib.h>
#include "my_math.h"

int main(int argc, char **argv) {
	int i, res; /* temporaries */

	FILE *f;
	int num_values;
	float *all_values;
	float std_dev;

	if(argc != 2) {
		fprintf(stderr, "Usage: stddev input_file\n");
		exit(1);
	}

	/* Open the File */
	f = fopen(argv[1], "r");
	if(f == NULL) {
		perror("Unable to open file");
		exit(1);
	}

	/* Get the total number of values to read */
	res = fscanf(f, "%d", &num_values);
	if(res != 1) {
		fprintf(stderr, "Invalid file format.");
		exit(1);
	}

	/* Allocate memory for all values */
	all_values = (float *)malloc(sizeof(float) * num_values);

	/* Read in all of the values */
	for(i = 0; i < num_values; i++) {
		res = fscanf(f, "%f", &all_values[i]);
		if(res != 1) {
			fprintf(stderr, "Invalid file format.");
			exit(1);
		}
	}

	/* Perform calculation */
	std_dev = calculate_standard_deviation(num_values, all_values);

	/* Print result */
	printf("The standard deviation is %f\n", std_dev);

	return 0;
}


Listing 4. Makefile
                
OBJS = my_math.o main.o
LIBS = -lm
CFLAGS = -m32 -O3
LDFLAGS = -L.

stddev: $(OBJS)
	$(CC) $(LDFLAGS) $(OBJS) $(LIBS) -o stddev

.c.o:
	$(CC) -c $(CFLAGS) -c $<

clean:
	rm -rf *.o

test: stddev
	./stddev test.dat
	


Listing 5. Test datafile (test.dat)
                
8
0.51659
0.40238
0.81590
0.14230
0.00324
0.99185
0.81089
0.00253

To build, just issue a make command; to test, just do make test.

Now when you port the code to use the SPEs, it is good to keep in mind the maintainers of the other platforms and to try not to cause problems for them. Here are a few guidelines to follow to make the impact for other platform maintainers minimal:

  • Keep the source changes as localized as possible.
  • Put code that is to run on the SPE in a file named something other than .c, like .spuc — this will make it obvious that these programs shouldn't be built by the standard compiler and will allow you to write some automatic compilation rules.
  • Embedding the SPE executable into an object file can sometimes cause problems for complicated build systems. In these cases, just keep them separate and use spe_image_open at run time. However, this can make distribution slightly more complex since it is another file that will have to be installed, but only on certain platforms.
  • Create an alternate malloc() function for creating memory blocks that will be passed to SPE functions. This way you can align and pad them appropriately.
  • It is much easier to port 32-bit applications to the SPE than 64-bit applications because this makes the pointer sizes equivalent on both platforms and it allows transmission of a pointer through a single SPE mailbox write. In any case, you must decide beforehand whether the SPE will be written for a 64-bit or a 32-bit environment since this will affect the data layouts in the program any time where pointers are involved.

A simplistic SPE RPC library

The first part of the port will be creating a simplistic SPE remote procedure call library. This library will handle loading in SPE programs and sending them requests. The easiest and most general way to do this is to give each SPE context/program one function to perform. The SPE program will merely sit in an infinite loop waiting for data to process. On each iteration, it will wait for a pointer to arrive in its mailbox, DMA in the input parameters from the pointer, marshall those parameters to the real function, and then the real function will execute and hand the result back to the marshaller which will DMA the result back to main memory and signal the completion of the operation.

The PPE program will have a stub function that will do the following:

  • Maintain a static pointer to the SPE program context that performs the work of the function.
  • Load/start the function when the function is first run using the wrapper function spe_remote_function_start (it is started in the function itself so as to minimize the impact on the rest of the code).
  • Create a structure to pass the parameters.
  • Call the SPE program using the thin RPC system with the function spe_remote_call and pass it the address of the parameter structure.
  • Return the result of the function which is now back in the parameter structure.

Here is the library header file:


Listing 6. Header file (speport.h)
                
#ifndef SPE_PORT_H
#define SPE_PORT_H

/* Since we rewrite malloc with the preprocessor, we need to make sure */
/* we include its header file first */
#include <stdlib.h>

/* Alignment macros */
#define SPE_ALIGNMENT 16
#define SPE_ALIGNMENT_FULL 128
#define SPE_ALIGN __attribute__((aligned(16)))
#define SPE_ALIGN_FULL __attribute__((aligned(128)))
#define ROUND_UP_ALIGN(value, alignment)\  
 (((value) + ((alignment) - 1))&(~((alignment)-1)))

/* Redefine malloc to use our own version */
#define malloc(x) spe_aligned_malloc((x))

/* Hide the PPE header info from the SPE */
#ifndef __SPU__

/* Makes it easier to dereference integers from a */
/* pointer and a local store base address */
#define SPE_DEREF_UINT32(base, offset) *((unsigned int *)(((char *)(base)) + (offset)))

#include <pthread.h>
#include <libspe2.h>

/* Basic process information */
typedef struct {
	char *spe_filename;
	void *initialization_data;
	spe_program_handle_t *spe_image;
	spe_context_ptr_t spe_context;
	pthread_t spe_thread;
} spe_remote_function_t;
typedef spe_remote_function_t *spe_remote_function_ptr_t;

/* Functions */

/* Malloc() for PPE programs allocating data to pass to SPE programs */
void *spe_aligned_malloc(unsigned int size);

/* Initialize an SPE function - initialization_data is passed to the */
/* SPE program as argp */
spe_remote_function_ptr_t spe_remote_function_start(char *spe_program_filename, 
 void *initialization_data);

/* Terminate an SPE function (not normally used; */
/* it is automatically terminated on exit) */
void spe_remote_function_kill(spe_remote_function_ptr_t);

/* Run the SPE function; must be passed an integer pointer for status and can be */
/* called in blocking or nonblocking mode */
int spe_remote_call(spe_remote_function_ptr_t spe_func, void *arguments, int runflags, 
 int *status_ptr);

/* For non-blocking calls, use this function to wait for them to complete */
int spe_wait_completion(volatile int *status_ptr, int busy_wait);

/* Run flags */
#define SPE_RUN_NONBLOCK 1
#define SPE_RUN_BLOCKING 0
#endif


#endif

So now look at how you would use this in the standard deviation program to make calls to the SPE processor. First of all, you would need a header file to define the parameter structure used to pass the data. The file looks like this:


Listing 7. Standard deviation parameter header for the SPE (my_math_spe.h)
                
#ifndef MY_MATH_SPE_H
#define MY_MATH_SPE_H

#include "speport.h"

typedef struct {
	int num_values SPE_ALIGN;
	float *values SPE_ALIGN;
	float result SPE_ALIGN;
} spe_std_dev_params_t;
#endif

Then, to call the function from the PPE, you would do this:


Listing 8. Changes to call SPE standard deviation function (in my_math.c)
                
...include files...

/* USE_SPE is a define we can pass to tell it to use or not use SPE-specific functions */
#ifndef USE_SPE
float calculate_standard_deviation(int num_values, float *values) {
	....original function here for other platforms....
}
#else
/* SPE-specific includes */
#include "speport.h"
#include "my_math_spe.h"

/* Stub for SPE function */
float calculate_standard_deviation(int num_values, float *values) {
	/* Initialize to NULL so we know on the first run to initialize it */
	static spe_remote_function_ptr_t std_dev_func = NULL;
	/* Parameter struct to call the SPE function */
	spe_std_dev_params_t params SPE_ALIGN;
	/* Status variable for the RPC */
	int status SPE_ALIGN;

	/* Start up the SPE process if this is our first run */
	if(std_dev_func == NULL) {
		std_dev_func = spe_remote_function_start("./spe_std_dev", NULL);
		if(std_dev_func == NULL) {
			fprintf(stderr, "Error starting thread!");
			exit(1);
		}
	}

	/* Make parameters */
	params.num_values = num_values;
	params.values = values;

	/* Call the function */
	if(spe_remote_call(std_dev_func, &params, SPE_RUN_BLOCKING, &status) < 0) 
{
		fprintf(stderr, "Error running function\n");
		exit(1);
	}

	/* Return the result */
	return params.result;
}
#endif

As you can see, this is a very thin but intuitive interface for making SPE function calls. It requires that both sides decide on how the data will be formatted, but overall it provides convenience where it is most needed.

The SPE side follows the following procedure:

  1. Perform any needed initialization (none in this case).
  2. Read the parameter address and the status address from the mailbox (spe_remote_call sends them through mailbox, and this function will block until there is actually data available).
  3. Load the parameter structure into the SPE local store.
  4. Call the main function itself.
  5. Write back the answer into main memory.
  6. Write back the success status code into main memory.
  7. Go back to step two.

Now take a look at how the SPE program is coded:


Listing 9. SPE implementation of the standard deviation function (spe_std_dev.spuc)
                
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
#include <math.h>
#include "my_math.h"
#include "my_math_spe.h"

/* The maximum number of values is limited by the size of the DMA transfer */
#define MAX_VALUES (16384 / sizeof(float))
/* All of our DMA transfers will use this tag */
#define DEFAULT_DMA_TAG 0

float spe_calculate_standard_deviation(int num_values, float *values_ea);

int main(unsigned long long spe_id, unsigned long long argvp) {
	int status SPE_ALIGN;

	/* All DMA transfers in this program are using tag 0 */
	mfc_write_tag_mask(1<<DEFAULT_DMA_TAG);

	/* This just sits in a loop and waits for requests to perform this function */
	while(1) {
		/* Block until we are given mailbox parameters */
		unsigned int address = spu_read_in_mbox();
		unsigned int status_address = spu_read_in_mbox();

		/* Marshall in parameters */
		spe_std_dev_params_t params;
		spu_mfcdma32(&params, address, sizeof(spe_std_dev_params_t), 
              DEFAULT_DMA_TAG, MFC_GET_CMD);
		spu_mfcstat(MFC_TAG_UPDATE_ALL);

		/* Check boundaries */
		if(params.num_values > MAX_VALUES) {
			/* Report error */
			status = -1;
			spu_mfcdma32(&status, status_address, sizeof(int), 
                    DEFAULT_DMA_TAG, MFC_PUT_CMD);
		} else {
			/* Perform task */
			params.result = spe_calculate_standard_deviation(params.num_values, 
                    params.values);
			
			/* Write back results */
			spu_mfcdma32(&params, address, sizeof(spe_std_dev_params_t), 
                    DEFAULT_DMA_TAG, MFC_PUTB_CMD);

			/* Send status notification */
			status = 1;
			spu_mfcdma32(&status, status_address, sizeof(int), 
                    DEFAULT_DMA_TAG, MFC_PUTB_CMD);
		}
	}
}

/* Actual function implementation */
float spe_calculate_standard_deviation(int num_values, float *values_ea) {
	int i;
	float sum = 0.0, sum_squares = 0.0;
	float avg, variance, std_dev;
	static float ls_values[MAX_VALUES];

	/* Load in values from main memory pointer */
	spu_mfcdma32(ls_values, (unsigned int)values_ea, num_values * sizeof(float), 
        DEFAULT_DMA_TAG, MFC_GET_CMD);
	spu_mfcstat(MFC_TAG_UPDATE_ALL);

	/* Loop through all the values */
	for(i = 0; i < num_values; i++) {
		sum += ls_values[i];
		sum_squares += ls_values[i]*ls_values[i];
	}
	
	avg = sum / (float)num_values;
	variance = (sum_squares - (sum * avg)) / (float)num_values;
	std_dev = sqrt(variance);
	
	return std_dev;	
}

So, while the marshalling code is a little annoying to write, it is all fairly straightforward.

Now that you've seen what the library can do, look at how it is coded:


Listing 10. Library implementation (speport.c)
                
#include <stdlib.h>
#include "speport.h"

static void *spe_remote_function_thread(void *);

/* Allocate aligned memory - same call interface as malloc() for */
/* easy use in existing programs */
void *spe_aligned_malloc(unsigned int size) {
	void *data;
	/* Align the memory and make sure that we round up the size to */
      /* an aligned multiple */
	posix_memalign(&data, SPE_ALIGNMENT, ROUND_UP_ALIGN(size, SPE_ALIGNMENT));
	return data;
} 

/* Initialize the SPE program from the given file */
spe_remote_function_ptr_t spe_remote_function_start(char *spe_program_filename, 
  void *initialization_data) {
	int retval;
	spe_remote_function_ptr_t spe_func;
	
	/* Allocate the structure */
	spe_func = (spe_remote_function_ptr_t)malloc(sizeof(spe_remote_function_t));
	/* Save the filename (we don't currently use it, but we keep it anyway) */
	spe_func->spe_filename = spe_program_filename;
	/* Save the initialization data for when we create the thread */
	spe_func->initialization_data = initialization_data;
	/* Save the SPE image */
	spe_func->spe_image = spe_image_open(spe_program_filename);
	if(spe_func->spe_image == NULL) {
		return NULL;
	}
	/* Create the context */
	spe_func->spe_context = spe_context_create(SPE_EVENTS_ENABLE|SPE_MAP_PS, NULL);
	if(spe_func->spe_context == NULL) {
		return NULL;
	}
	/* Load the SPE image into the context */
	spe_program_load(spe_func->spe_context, spe_func->spe_image);

	/* Create and start the thread for the SPE to execute in */
	retval = pthread_create(&spe_func->spe_thread, NULL, 
        spe_remote_function_thread, spe_func);

	if(retval) {
		return NULL;
	} else {
		return spe_func;
	}
}

/* This is the function that pthread_create calls */
void *spe_remote_function_thread(void *data) {
	spe_remote_function_ptr_t spe_func = (spe_remote_function_ptr_t)data;
	unsigned int entry_point = SPE_DEFAULT_ENTRY;
	int retval;

	/* Switch to running on the SPE */
	retval = spe_context_run(spe_func->spe_context, &entry_point, 0, 
        spe_func->initialization_data, NULL, NULL);

	if(retval != 0) {
		perror("Error running SPE thread");
	}

	pthread_exit(NULL);
}

/* Force kill an SPE function (normally not needed) */
void spe_remote_function_kill(spe_remote_function_ptr_t spe_func)  {
	pthread_cancel(spe_func->spe_thread);
	pthread_join(spe_func->spe_thread, NULL);
	spe_context_destroy(spe_func->spe_context);
}

/* Perform the thunk to call the SPE */
int spe_remote_call(spe_remote_function_ptr_t spe_func, void *argument_ptr, int runflags, 
  int *status_ptr) {
	/* Initialize the status pointer */
	*status_ptr = 0;

	/* Send a pointer to the arguments through the mailbox */
	spe_in_mbox_write(spe_func->spe_context, &argument_ptr, 1, 
        SPE_MBOX_ALL_BLOCKING);

	/* Send a pointer to the status through the mailbox */
	spe_in_mbox_write(spe_func->spe_context, &status_ptr, 1, 
        SPE_MBOX_ALL_BLOCKING);
	
	/* If this is a blocking call, wait until it completes */
	if(runflags == SPE_RUN_BLOCKING) {
		return spe_wait_completion(status_ptr, 0);
	} else {
		return 0;
	}
}

/* Wait until a call has finished by monitoring *status_ptr */
/* (finished when *status_ptr != 0) */
int spe_wait_completion(volatile int *status_ptr, int busy) {
	int status;
	while(1) {
		status = *status_ptr;
		if(status) {
			return status;
		} else {
			if(!busy) {
				/* If we are not in busy wait mode, yield the processor */
				sched_yield();
			}
		}
	}
}

Again, mostly straightforward stuff if you are familiar with libspe2. (If you are not, see Resources.)

Now, for main.c, you just need to conditionally include the speport.h to overwrite the malloc() function.


Listing 11. Using the new malloc() in main.c
                
#ifdef USE_SPE
#include "speport.h"
#endif

...rest of program...	

Now, the Makefile just needs a few more entries:


Listing 12. Improved makefile
                
OBJS = my_math.o main.o
LIBS = -lm
CFLAGS = -m32 -O3
LDFLAGS = -L.

ifdef USE_SPE
CFLAGS += -DUSE_SPE
LIBS += -lspe2 -lspeport
#NOTE - the "-x c" is because we changed the extension, so we have to tell the compiler 
#to use the C frontend.
SPU_CC = spu-gcc -x c
SPU_CFLAGS = -O3
SPU_LDFLAGS = -L.
SPU_LIBS = -lm
endif

stddev: $(OBJS)
	$(CC) $(CFLAGS) $(LDFLAGS) $(OBJS) $(LIBS) -o stddev

libspeport.a: speport.c
	$(CC) $(CFLAGS) -c speport.c
	ar rc libspeport.a speport.o

spe_std_dev: spe_std_dev.spuc
	$(SPU_CC) $(SPU_CFLAGS) spe_std_dev.spuc $(SPU_LDFLAGS) $(SPU_LIBS) -o spe_std_dev

.c.o:
	$(CC) -c $(CFLAGS) -c $<

clean:
	rm -rf *.o *.a spe_std_dev

test: stddev
	./stddev test.dat

So now you can build the project with the following commands:

make clean
USE_SPE=1 make libspeport.a
USE_SPE=1 make spe_std_dev
USE_SPE=1 make stddev
make test

This will rebuild the program using the new SPE function. And since you have written it as a nice port, if you want to rebuild it without the SPE functionality, you can rerun the same process without the USE_SPE=1 and it will build for the PPE only (and on other processors).

In conclusion

This article covered the basics about how to port existing applications to the Cell/B.E. processor's SPEs. We implemented a small porting library (which you may feel free to use in your own projects), showed the necessary Makefile changes, and showed how to create a program which can be compiled with or without SPE support by simply modifying an environment variable.

This program still has numerous problems. We do not yet have a great speed advantage over the PPE code—they are fairly equivalent. It does allow the PPE to be available for other processing, but there are still numerous tweaks that can be done to speed up the code, not the least of which is splitting it up among multiple SPEs. In addition, the SPE code is currently limited to data sets that are a multiple of four values and max out at 4,096 values. In the next article, I will show you how to rectify each of these limitations. However, I hope this gave you a good idea of where to get started in porting an application to the Cell/B.E. processor.



Resources

Learn

Get products and technologies

Discuss


About the author

Jonathan Bartlett is the author of the book Programming from the Ground Up which is an introduction to programming using Linux assembly language. He is the lead developer at New Medio, developing Web, video, kiosk, and desktop applications for clients.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top


Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment, Inc., in the United States, other countries, or both and is used under license therefrom. Other company, product, or service names may be trademarks or service marks of others. UNIX is a registered trademark of The Open Group in the United States and other countries. Other company, product, or service names may be trademarks or service marks of others.