Skip to main content

skip to main content

developerWorks  >  Power Architecture technology | Linux  >

Programming high-performance applications on the Cell/B.E. processor, Part 6: Smart buffer management with DMA transfers

Exploiting double-buffering and multibuffering to keep the SPU working

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

Jonathan Bartlett (johnnyb@eskimo.com), Director of Technology, New Medio

03 Apr 2007

Explore the concepts of double-buffering and multibuffering to improve code speed by parallelizing processing and data transfer, and allowing the SPE's memory flow controller (MFC) to coordinate the best order of operations for loading and storing.

The code written in the previous article in this series followed the following basic pattern:

  1. The SPU queues a DMA GET to pull a portion of the problem data set from main memory to a buffer.
  2. The SPU waits for the buffer to fill up.
  3. The SPU processes the buffer.
  4. The SPU queues a DMA PUT to transmit the buffer back to main memory.
  5. The SPU waits for the buffer to finish being transmitted.
  6. If data remains, the procedure starts again.

The problem with this procedure is that it wastes a lot of good processor time. The two transmission steps do not involve the SPU at all, but only the MFC (which is part of the larger SPE). In the code written so far, the SPU has simply waited for the MFC to finish before processing anything else. Certainly we can find something for it to do while it waits.

Double-buffering

It works a lot like the doctor's office -- I know that when I get there I'm going to spend a lot of time in the waiting room. Therefore, I always bring something else to do while I'm waiting. The same principle applies to programming. Rather than waste good processor cycles waiting around for data to transmit, your code can instead have a second buffer waiting to process. Therefore, while you wait for one set of data to transmit, you can be processing another. So the new processing algorithm looks like this:

  1. The SPU queues a DMA GET to pull a portion of the problem data set from main memory into buffer #1.
  2. The SPU queues a DMA GET to pull a portion of the problem data set from main memory into buffer #2.
  3. The SPU waits for buffer #1 to finish filling.
  4. The SPU processes buffer #1.
  5. The SPU (a) queues a DMA PUT to transmit the contents of buffer #1 and then (b) queues a DMA GETB to execute after the PUT to refill the buffer with the next portion of data from main memory.
  6. The SPU waits for buffer #2 to finish filling.
  7. The SPU processes buffer #2.
  8. The SPU (a) queues a DMA PUT to transmit the contents of buffer #2 and then (b) queues a DMA GETB to execute after the PUT to refill the buffer with the next portion of data from main memory.
  9. Repeat starting at step 3 until all data has been processed.
  10. Wait for all buffers to finish.

Of course, this algorithm probably raises more questions than it answers. First of all, notice that you are potentially doing a lot of unnecessary work when the buffer runs out, since you are processing two buffers for each loop iteration. You could throw in several if statements for early exit and to stop the buffer refill process when you are out of data. However, for this program I opted against it because it introduces a lot of extra processing for each iteration. If the code processes large datasets, then the cost of each iteration is much more important than the cost of setup or teardown. Therefore, to avoid branches I offload as much of the conditional work onto the SPE as possible. For the buffer processing, the MFC treats a zero-size data request as a no-op, so I can go ahead and issue requests even if there is no data to read. For the actual buffer processing, the function is perfectly capable of handling zero-sized buffers by simply returning. So all of the cases are already handled, and any branches to weed out extra teardown steps will only serve to slow down the default case.

Another question is how to schedule a PUT and a GET on the same buffer without causing conflicts. After each data processing step, you set up both a PUT to transfer the data to main memory and a GET to get the next batch of data. Since by default the MFC processes requests in any order it chooses, how do you force a specific ordering? As we discussed in the last article, the answer is with barriers and fences. Putting a fence on a request forces all previously issued MFC requests in the same tag group to be processed before the current request. However, it does not specify the ordering with respect to future transfers. A barrier is similar to a fence except that it enforces an ordering both with respect to previous and subsequent requests. Therefore, by sending the second request with either a fence or a barrier, you can force the MFC to process the requests in the proper order, and, because they are in the same tag group, when it comes time to use the buffer you can just wait on the completion of the whole tag group. GETB, PUTB, GETF, and PUTF are the primary fence and barrier-related DMA commands for single buffers.

Now, let's think about how you might apply this algorithm to your current uppercase-conversion code. For reference, here is the original code in convert_driver_c.c:


Listing 1. Original single-buffer MFC transfer program
                
#include <spu_intrinsics.h>
#include <spu_mfcio.h> /* constant declarations for the MFC */
typedef unsigned long long uint64;
typedef unsigned int uint32;

void convert_buffer_to_upper(char *conversion_buffer, int current_transfer_size);

#define MAX_TRANSFER_SIZE 16384
char conversion_buffer[MAX_TRANSFER_SIZE];

typedef struct {
	uint32 length __attribute__((aligned(16)));
	uint64 data __attribute__((aligned(16)));
} conversion_structure;

int main(uint64 spe_id, uint64 conversion_info_ea) {
	conversion_structure conversion_info; /* Information about the data from the PPE */

	/* New variables to keep track of where we are in the data */
	uint32 remaining_data; /* How much data is left in the whole string */
	uint64 current_ea_pointer; /* Where we are in system memory */
	uint32 current_transfer_size; /* How big the current transfer is (may
	                               * be smaller than MAX_TRANSFER_SIZE) */

	/* We are only using one tag in this program */
	mfc_write_tag_mask(1<<0);

	/* Grab the conversion information */
	mfc_get(&conversion_info, conversion_info_ea, sizeof(conversion_info), 0, 0, 0);
	spu_mfcstat(MFC_TAG_UPDATE_ALL); /* Wait for Completion */

	/* Setup the loop */
	remaining_data = conversion_info.length;
	current_ea_pointer = conversion_info.data;

	while(remaining_data > 0) {
		/* Determine how much data is left to transfer */
		if(remaining_data < MAX_TRANSFER_SIZE)
			current_transfer_size = remaining_data;
		else
			current_transfer_size = MAX_TRANSFER_SIZE;

		/* Get the actual data */
		mfc_getb(conversion_buffer, current_ea_pointer, current_transfer_size, 0, 0, 0);
		spu_mfcstat(MFC_TAG_UPDATE_ALL);

		/* Perform the conversion */
		convert_buffer_to_upper(conversion_buffer, current_transfer_size);

		/* Put the data back into system storage */
		mfc_putb(conversion_buffer, current_ea_pointer, current_transfer_size, 0, 0, 0);

		/* Advance to the next segment of data */
		remaining_data -= current_transfer_size;
		current_ea_pointer += current_transfer_size;
	}
	spu_mfcstat(MFC_TAG_UPDATE_ALL); /* Wait for Completion */
}

This program requires the following additional files from previous articles: convert_buffer_c.c from Part 5 and ppu_dma_main.c from Part 3 (there is another version later on in this article as well). Compile and run just like the previous article (these build commands will work for all examples in this article):

spu-gcc convert_buffer_c.c convert_driver_c.c -o spe_convert
embedspu -m64 convert_to_upper_handle spe_convert spe_convert_csf.o
gcc -m64 spe_convert_csf.o ppu_dma_main.c -lspe -o dma_convert
./dma_convert

To make this program double-buffered, you need to refactor the code slightly. First of all, you should keep all of the buffer-specific data together. Each buffer will need to have tied to it:

  • the address of the buffer itself
  • the effective address the buffer was filled from
  • the size of the data being processed

Therefore, create the following struct to hold all buffer-specific information:

struct {
	uint64 effective_address __attribute__((aligned(16)));
	uint32 size __attribute__((aligned(16)));
	char data[MAX_TRANSFER_SIZE] __attribute__((aligned(16)));
} buffer;

Then you only need to declare a global array of two of these buffers:

buffer buffers[2];

Now, break up the conversion process into two function calls:

  1. initiate the data buffer load
  2. wait for, process, and store back the data in the buffer

You break it up this way because these are the independent units which have to be rearranged. Initiating the data load has to be called at the beginning of the program, so it needs to be separated into its own function. So here is the code for the double-buffered version of the MFC code (again, it's convert_driver_c.c):


Listing 2. Double-buffering MFC transfers
                
#include <spu_intrinsics.h>
#include <spu_mfcio.h>

/* Constants */
#define MAX_TRANSFER_SIZE 16384

/* Data Structures */
typedef unsigned long long uint64;
typedef unsigned int uint32;
typedef struct {
	uint32 length __attribute__((aligned(16)));
	uint64 data __attribute__((aligned(16)));
} conversion_structure;

typedef struct {
	uint32 size __attribute__((aligned(16)));
	uint64 effective_address __attribute__((aligned(16)));
	char data[MAX_TRANSFER_SIZE] __attribute__((aligned(16)));
} buffer;

/* Global Variables */
buffer buffers[2];

/* Utility Functions */
inline uint32 MIN(uint32 a, uint32 b) {
	return a < b ? a : b;
}

inline void wait_for_completion(uint32 mask) {
	mfc_write_tag_mask(mask);
	spu_mfcstat(MFC_TAG_UPDATE_ALL);
}

inline void load_conversion_info(uint64 cinfo_ea, uint64 *data_ea, uint32 *data_size) {
	conversion_structure cinfo;
	mfc_get(&cinfo, cinfo_ea, sizeof(cinfo), 0, 0, 0);
	wait_for_completion(1<<0);
	*data_size = cinfo.length;
	*data_ea = cinfo.data;
}

/* Processing Functions */
inline void initiate_transfer(uint32 buf_idx, uint64 *current_ea_pointer,
uint32 *remaining_data) {
	/* Setup buffer information */
	buffers[buf_idx].size = MIN(*remaining_data, MAX_TRANSFER_SIZE);
	buffers[buf_idx].effective_address = *current_ea_pointer;
	/* Initiate transfer using the buffer index as the DMA tag */
	mfc_getb(buffers[buf_idx].data, buffers[buf_idx].effective_address,
		buffers[buf_idx].size, buf_idx, 0, 0);
	/* Move the data pointers */
	*remaining_data -= buffers[buf_idx].size;
	*current_ea_pointer += buffers[buf_idx].size;
}

inline void process_and_put_back(uint32 buf_idx) {
	wait_for_completion(1<<buf_idx);
	/* Perform conversion */
	convert_buffer_to_upper(buffers[buf_idx].data, buffers[buf_idx].size);
	/* Initiate the DMA transfer back using the buffer index as the DMA tag */
	mfc_putb(buffers[buf_idx].data, buffers[buf_idx].effective_address,
		buffers[buf_idx].size, buf_idx, 0, 0);
}

/* Main Code */
int main(uint64 spe_id, uint64 conversion_info_ea) {
	uint32 remaining_data;
	uint64 current_ea_pointer;

	load_conversion_info(conversion_info_ea, &current_ea_pointer, &remaining_data);

	/* Start filling buffers to prepare for loop (loop assumes both buffers have
	 * data coming in) */
	initiate_transfer(0, &current_ea_pointer, &remaining_data);
	initiate_transfer(1, &current_ea_pointer, &remaining_data);

	do {
		/* Process buffer 0 */
		process_and_put_back(0);
		initiate_transfer(0, &current_ea_pointer, &remaining_data);

		/* Process buffer 1 */
		process_and_put_back(1);
		initiate_transfer(1, &current_ea_pointer, &remaining_data);
	} while(buffers[0].size != 0);

	wait_for_completion(1<<0|1<<1);
}

Note that since this code only deals with buffers, there is almost no code except the function call which is specific to uppercase-conversion. It can all be reused almost verbatim in other contexts.



Back to top


Multibuffering

The generalized idea used in the previous section is called "software pipelining." That is, you divide up your processing into stages which can be overlapped during execution to maximize throughput. In this case, your pipeline only really has two stages -- load/store and process. However, when generalizing this concept to other problems, there may be any number of "pipeline stages" that can be established. The basic idea is to give each pipeline its own buffer for processing, and then process each buffer a stage at a time. When a software pipeline uses more than two buffers it is called multibuffering. For the SPU, two-stage pipelines (like this one) work best for most applications because the data movement is handled by the MFC, not the processor, and therefore the pipeline stages can operate concurrently. The concurrent nature of the processing and data transfer is what gives two-stage pipelining its edge in SPE programming.

In addition to adding pipeline stages, there is another way to take advantage of additional buffers. The main one is to initiate a lot of data transfers on the MFC, and then let the MFC take control of deciding the ordering for processing. For example, let's say that one area of memory is currently in swap space while another one is in memory. By having lots of transfers outstanding on the MFC, the MFC can determine what the best transfer order will be. Also, this helps smooth out bus contention issues -- when the bus is full, the program can process the extra buffers rather than wait for the bus to free up. When the bus is free, it can refill the extra buffers. In this particular program, doing buffer handling in this way does not affect the execution time significantly, and in some data sets, affects it negatively. However, it is nonetheless a useful example to show another technique of buffer-handling, and, as you will see, specifically how to use MFC_TAG_UPDATE_ANY.

The new process will look like this:

  1. Queue DMA GETs for all buffers. Mark each buffer as "filling" if they are transferring more than zero bytes. Each buffer gets a unique DMA tag ID.
  2. If there are no bufffers marked "filling," wait for all DMA PUT operations to complete and exit.
  3. Wait for a single buffer that is marked as "filling" to become filled.
  4. Process the buffer.
  5. Queue a DMA PUT to transfer the buffer back to main memory.
  6. Queue a DMA GETB to refill the buffer with more data after the existing data is stored back.
  7. If the DMA transfer in the previous step is for at least one byte (in other words, there is actually data left to transfer), mark the buffer as "filling."
  8. Go back to step 2.

In this algorithm, the order in which buffers get processed is much less deterministic. The key difficulty with this is to keep the number of branches to a minimum. The potential sources of branching are determining whether a buffer should be marked as "filling," as well as polling the buffers to find out which ones are available. Both of these can be easily avoided by careful selection of SPU intrinsics and good data structure design.

Waiting for buffers to become available is actually rather easy. Given a mask of buffers that you are interested in, you can call spu_mfcstat(MFC_STAT_UPDATE_ANY), which will return a mask of all of those buffers which have no pending operations (in other words, all operations are finished), and also waits until at least one is available. Think of this as a specialized version of the C library function select, but for DMA transfers. Now, it will return all available buffers, but you only want one. Therefore, you have to convert the mask into a single index which can then be used to specify the buffer you are processing, and you have to do it without any branching. The SPU instruction clz (count leading zeroes, called spu_cntlz in the C language intrinsics) is perfect for this. You can translate the resulting mask back into a single index by counting the leading zeroes, and then subtracting that from 31. A possible assembly language instruction sequence to do this would be:

	#assume the mask is in $10
	#Count the leading zeroes
	clz $11, $10
	#Subtract that from 31
	sfi $12, $11, 31
	#$12 now has the index of the buffer we want to use.

In C, this can be written as:

	/* buffers_completed holds the mask */
	spu_extract(
		spu_sub(
			(int32)31,
			spu_cntlz(
				spu_promote(
					(uint32)buffers_completed, 0
				)
			)
		),
		0
	);

Of course, this only retrieves the first buffer available -- there may be more. However, those will be returned in subsequent loop iterations as well.

Now you need to determine how to store which buffers are currently "filling," and be able to set these flags without branching. The best way to store it is as a tag mask so that it can be used directly as your mask for spu_mfcstat. However, it is a little more difficult to set these bits conditionally without branching. The assembly language version looks like this:

	#$10 holds our buffer mask
	#$11 holds the size of the last transfer
	#$12 holds the index of the current buffer

	#Convert the current buffer index to a bit for a bit mask (stored in $14)
	il $13, 1
	shl $14, $13, $12

	#Turn the bit off in the original mask
	xor $10, $10, $14

	#is the last transfer greater than zero? (answer stored in $15)
	cgti $15, $11, 0

	#Turn the bit on or off based the previous result (answer stored in $14)
	and $14, $14, $15

	#Turn the bit on based on our existing results
	or $10, $10, $14

By properly scheduling this, you can get this down to ten cycles. However, this is the type of operation the compiler can take care of. In fact, you can just write it like this, and the compiler will optimize it appropriately:

	/* clear the bit */
	*buffers_with_data &= ~(1<<buf_idx);
	/* Set the bit conditionally */
	*buffers_with_data |= (buffers[buf_idx].size > 0 ? (1<<buf_idx) : 0);

In this program, since the problem is trivially parallelizable, you can actually have as many buffers as the SPU's local store can support. Because in this program each buffer could (in theory) have two DMA transfers active for it (a store and a load), the program can have a maximum of eight buffers, since the MFC can only handle 16 pending DMA operations. Now, if you went over this limit, it would not affect the logical operation of your program. Instead, when you added the 17th DMA operation, it would simply stall the SPU until one of the outstanding operations completed, and, at that point, it would allow the program to continue its next queuing operation.

Here is the code for the new version (again, it's convert_driver_c.c):


Listing 3. Multibuffering MFC transfers
                
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
typedef unsigned long long uint64;
typedef unsigned int uint32;
typedef int int32;

/* Constants */
#define MAX_TRANSFER_SIZE 16384
#define NUM_BUFFERS 8 /* The MFC supports only 16 queued transfers,
                       * and we have up to two active per buffer */

/* Data Structures */
typedef struct {
	uint32 length __attribute__((aligned(16)));
	uint64 data __attribute__((aligned(16)));
} conversion_structure;

typedef struct {
	uint32 size __attribute__((aligned(16)));
	uint64 effective_address __attribute__((aligned(16)));
	char data[MAX_TRANSFER_SIZE] __attribute__((aligned(16)));
} buffer;

buffer buffers[NUM_BUFFERS];

/* Utility functions */
inline uint32 MIN(uint32 a, uint32 b) {
	return a < b ? a : b;
}

/* Processes the buffer, queues a DMA transfer to put the data back, and clears out
 * the "waiting for data" bit in buffers_with_data */
inline void process_and_put_back(uint32 buf_idx, uint32 *buffers_with_data) {
	convert_buffer_to_upper(buffers[buf_idx].data, buffers[buf_idx].size);
	mfc_putb(buffers[buf_idx].data, buffers[buf_idx].effective_address,
		buffers[buf_idx].size, buf_idx, 0, 0);
	*buffers_with_data &= ~(1<<buf_idx); /* Clear out bit for this buffer */
}

/* Queues up a DMA GET transfer, ad, if there is any data to transfer, sets
 * the appropriate bit in buffers_with_data to indicate that we are waiting
 * for data in this buffer */
inline void initiate_transfer(uint32 buf_idx, uint32 *buffers_with_data,
uint64 *current_ea_pointer, uint32 *remaining_data) {
	/* Setup buffer */
	buffers[buf_idx].size = MIN(*remaining_data, MAX_TRANSFER_SIZE);
	buffers[buf_idx].effective_address = *current_ea_pointer;

	/* Move Data Pointers */
	*remaining_data -= buffers[buf_idx].size;
	*current_ea_pointer += buffers[buf_idx].size;

	/* Initiate transfer (does nothing if there is no data) */
	mfc_get(buffers[buf_idx].data, buffers[buf_idx].effective_address,
		buffers[buf_idx].size, buf_idx, 0, 0);

	/* Set the "Buffer Waiting for Data" bit only if there is data to read */
	*buffers_with_data |= (buffers[buf_idx].size > 0 ? (1<<buf_idx) : 0);
}

/* Waits for all of the given buffers to complete */
inline void wait_for_completion(uint32 mask) {
	mfc_write_tag_mask(mask);
	spu_mfcstat(MFC_TAG_UPDATE_ALL);
}

/* Loads information about the whole conversion process */
inline void load_conversion_info(uint64 conversion_info_ea, uint64 *current_ea_pointer,
uint32 *remaining_data) {
	conversion_structure conversion_info;

	mfc_get(&conversion_info, conversion_info_ea, sizeof(conversion_info), 0, 0, 0);
	wait_for_completion(1<<0);

	*remaining_data = conversion_info.length;
	*current_ea_pointer = conversion_info.data;
}

/* Returns the index of the first buffer with data available*/
inline uint32 get_next_buffer(uint32 buffers_with_data) {
	uint32 buffers_completed; /* This will contain a mask of buffers whose
	                           * transfers have completed */

	/* These are the buffers to look for */
	mfc_write_tag_mask(buffers_with_data);

	/* Wait for at least one buffer to come available */
	buffers_completed = spu_mfcstat(MFC_TAG_UPDATE_ANY);

	/* Use "count leading zeros" to determine the buffer index from
	 * the buffers_completed mask */
	return spu_extract(
		spu_sub(
			(int32)31,
			spu_cntlz(
				spu_promote((uint32)buffers_completed, 0)
			)
		),
		0
	);
}

/* Steps are numbered according to the description in this section */
int main(uint64 spe_id, uint64 conversion_info_ea) {
	uint32 remaining_data;
	uint64 current_ea_pointer;
	uint32 buffers_with_data = 0; /* This is the bit mask for each buffer waiting on data,
	                               * used for spu_mfcstat in the main loop */
	uint32 all_buffers = 0; /* This is used to wait on all remaining transfers at
	                         * the end of the program*/
	uint32 current_buffer_idx;

	load_conversion_info(conversion_info_ea, &current_ea_pointer, &remaining_data);

	/* Step 1: Get all buffers loading (because NUM_BUFFERS is a constant, the compiler
	 *         should unroll the loop all the way) */
	for(current_buffer_idx = 0; current_buffer_idx < NUM_BUFFERS; current_buffer_idx++) {
		initiate_transfer(current_buffer_idx, &buffers_with_data,
			&current_ea_pointer, &remaining_data);
		all_buffers |= 1<<current_buffer_idx;
	}

	/* Step 2: Continue while there are still buffers pending */
	while(buffers_with_data != 0) {
		/* Step 3: Get the next buffer that gets filled */
		current_buffer_idx = get_next_buffer(buffers_with_data);
		/* Steps 4 and 5: Process the buffer and queue up a DMA transfer back to main memory */
		process_and_put_back(current_buffer_idx, &buffers_with_data);
		/* Steps 6 and 7: Queue up a buffer reload, and mark the buffer as "filling"
		 *                (by setting the appropriate bit in remaining_data) */
		initiate_transfer(current_buffer_idx, &buffers_with_data,
			&current_ea_pointer, &remaining_data);
	}

	/* Wait for all PUTs to complete */
	wait_for_completion(all_buffers);
}

This code is geared specifically to make sure that the main loop and the function call to convert_buffer_to_upper are the only mandatory branches. The other possible branches in the code are either inline functions (which can, obviously, be inlined by the compiler), or are easily branch-eliminated by the compiler. Pretty much any branch that can be reduced to the ternary operatory ? : using code without side-effects or non-inline function calls can be branch-eliminated by the compiler (either GCC or XLC).

Now, the PPE program that we have been using so far for testing the SPE program only uses one buffer, so it doesn't make any use of our optimizations, and it is hard to see performance differences. To see how these programs perform with larger datasets, here is a version of the ppu_dma_main.c program that uses large data sets and times the SPU:


Listing 4. Driver program to test large datasets
                
#include <stdio.h>
#include <libspe.h>
#include <errno.h>
#include <string.h>
#include <sys/time.h>
#include <malloc.h>

/* Size of Buffer -- MUST be a multiple of 16 */
#define BUF_SIZE (16 * 200000)

/* embedspu actually defines this in the generated object file, 
we only need an extern reference here */
extern spe_program_handle_t convert_to_upper_handle;

/* This is the parameter structure that our SPE code expects */
/* Note the alignment on all of the data that will be passed to the SPE is 16-bytes */
typedef struct {
	int length __attribute__((aligned(16)));
	unsigned long long data __attribute__((aligned(16)));
} conversion_structure;

int main() {
	int status = 0;
	int i;
	struct timeval initial_time, final_time;

	/* Create the string on an aligned boundary */
	char *str = memalign(16, BUF_SIZE);

	/* Fill the string with data */
	for(i = 0; i < BUF_SIZE - 1 ; i++) {
		str[i] = 'a' + i % 26;
	}

	/* Null-terminate string */
	str[BUF_SIZE - 1] = '\0';

	/* Create conversion structure on an aligned boundary */
	conversion_structure conversion_info __attribute__((aligned(16)));

	/* Set the data elements in the parameter structure */
	conversion_info.length = BUF_SIZE; /* add one for null byte */
	conversion_info.data = (unsigned long long)str;

	/* Check starting time */
	gettimeofday(&initial_time, NULL);

	/* Create the thread and check for errors */
	speid_t spe_id = spe_create_thread(0, &convert_to_upper_handle, 
	&conversion_info, NULL, -1, 0);
	if(spe_id == 0) {
		fprintf(stderr, "Unable to create SPE thread: errno=%d\n", errno);
		return 1;
	}

	/* Wait for SPE thread completion */
	spe_wait(spe_id, &status, 0);

	/* Check final time */
	gettimeofday(&final_time, NULL);

	/* Print SPU execution time */
	fprintf(stderr, "%llu microseconds\n",
		((long long)final_time.tv_sec * 1000000 + final_time.tv_usec) -
		((long long) initial_time.tv_sec * 1000000 + initial_time.tv_usec));

	/* Print out result - uncomment if you really want to see it*/
	//printf("The converted string is: %s\n", str);

	return 0;
}



Back to top


Conclusion

This article looked at two techniques for buffer management on the SPE -- double-buffering and multibuffering. You saw how to extend the existing code to enable it to have several buffers active at the same time and let the MFC decide the order in which they are filled, making sure at every step of the way to structure the code so as not to introduce any unnecessary branches.



Resources



About the author

Jonathan Bartlett is the author of the book Programming from the Ground Up , an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top


Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment, Inc., in the United States, other countries, or both and is used under license therefrom. Other company, product, or service names may be trademarks or service marks of others.