This article takes you on a tour of how portable an application designed with Gedae technology can be. The example takes an application running as a simulation on a PC, ports it to a working application running on a Mercury Computer System AdapDev system, and then ports it to a working application running on a Cell/B.E. system.
You can see the work required to perform each step, paying close attention to the architecture considerations needed when porting. You can learn about the amount of work required to port the application and the performance of the application on each system.
This article structure:
- Shows you the basic application
- Looks at the simulation
- Illuminates the multiprocessor and multicore implementations
The example application involves tracking a model train as it goes in a circular path around its track. The application uses input audio data from four microphones placed in a circle around the track to locate the train in the audio field. Using this location, the application pans and tilts a camera to point at the engine of the train. An illustration of this environment is shown in Figure 1.
Figure 1. The tracking algorithm targeting a train running on a track with four microphones as sensor inputs
The algorithm is based on RADAR technology. A beamformer correlates a linear array of RADAR sensors to identify a target based on a beam of high correlation. In this application, the array is circular, so the high intensity in the correlation of the four channels forms a spot, as shown in Figure 2.
Figure 2. The correlation of the four audio channels forming a spot of high intensity
After the spot formation forms this audio map, a detection algorithm identifies the high intensity peak corresponding to the train. The pan and tilt angles are computed to reposition the camera.
Because this application must continue to work in noisy environments, several approaches are used to reduce jitter and ensure smooth tracking. The input channels are run through low-pass filtering to remove frequencies outside the desired band. In the detection algorithm, several peaks are identified and tested. Feedback is used to monitor the speed and direction of the train to help rule out spurious peaks in the correlation data.
Figure 3 shows how the application looks in Gedae.
Figure 3. How the application looks in Gedae
Notice that the same flow graph was used for the simulation, quad DSP board, and Cell/B.E. processor implementations.
The application was first developed as a simulation. The environment of the train and camera was simulated, and the four channels of audio data for the microphone array were read from files. To show the results of the simulation, a 3D rendering of the scene is presented from the view angle of the camera, as shown in Figure 4.
Figure 4. The simulation, including a 3D-model rendering of the environment
Using Gedae-simulation for the example application, experiments were done on multiprocessor implementations of the application to prepare to move it to hardware processing realtime data. Once created, the code of a Gedae application does not have to be changed in order to partition and map it to multiple processors. For the example application, several mappings to virtual processors were used in different configurations. The results were analyzed in the Gedae trace table.
Implementing multiprocessor DSP
To transition the example application to use real world data, it was ported to the Mercury Computer System AdapDev system. The MCS AdapDev system provides an Intel® Pentium host and two quad DSP boards where each DSP is a 500MHz AltiVec processor (see Figure 5).
- Pentium III development host: 1.26GHz; 1GB SDRAM
- Quad PowerPC® 500MHz (MCP7410): AltiVec instruction set; 2 MB L2 cache; 256 MB SDRAM; DMA engines
- RACE++ switched-fabric architecture
Figure 5. The Mercury Computer System AdapDev system
Physical components for the camera, gimbal, microphones, and audio digital converter (ADC) were assembled. The application was altered to remove the artificial audio source and scene rendering, which were replaced with an interface to the ADC (using PCI), the gimbal (using a serial port), and the camera (using USB).
While the sources and sinks were replaced to use real-world data, the algorithms and their coding did not require changing to create a realtime implementation. For the example, a partitioning and mapping scheme was entered into the partition and map partition tables, as shown in Figure 6.
Figure 6. The partitioning and mapping scheme
Notice that the application was mapped to multiple DSP processors without changing the code.
The communication protocols can be tweaked using the transfer table by picking direct schedule access transfers (equivalent to DMA) and removing the blocking of the host-to-DSP transfers.
Additionally, you can use automated strip mining to optimize vectorization and improve cache utilization. These changes, which include both changing the graph to use real-world data and setting the implementation parameters, take about one day of effort, and the resulting implementation can process three frames per second—enough to track the train at its maximum speed with few errors or jitter.
One of the most powerful features of Gedae is the ease of partitioning and mapping an application to run on multiple processing elements (not to mention repartitioning and remapping). By creating an application as a flow graph, it is easy to partition the graph into sections, and then map each of those sections to hardware. This act of partitioning the graph provides information to the Gedae compiler, which helps it plan the application's threads and adjust for the distribution you specified.
When a graph is ready to be distributed, you must first partition the graph. In the partition table, you are presented with a table that lists all the components in the flow graph. Simply select which components should be broken off into a new partition, and assign them to a new partition name.
Mapping those partitions to target hardware is just as easy. In the map partition table, you see a table listing all the partitions you have just made in the partition table. For each partition, select the processor number to which that partition will be mapped from a predefined list.
Many transfer methods can be made available in Gedae: from DMA to shared memory to processor-specific protocols, and through the transfer table. You can easily select transfer methods for each communication. The transfer methods are fully parameterized, allowing for precise specification of buffer sizes and other parameters so that the most efficient transfer can be used.
Reviewing multicore implementation
You are probably already familiar with the Cell/B.E. architecture, so here's a little refresher:
- Power Processing Element (PPE)
- Eight Synergistic Processing Elements (SPEs): VMX SIMD instruction set; DMA engines; 256 KB local storage (LS)
- System memory
- Element interconnect bus (EIB): Over 200 GBps
Figure 7. The Cell/B.E. architecture
To illustrate support for the Cell/B.E. platform, this application was ported to the Sony Playstation 3 (PS3). The Cell/B.E. on the Sony PS3 provides a dual-threaded PPE core, as well as 6 enabled SPEs. The SPEs are very efficient vector processors, but they have strict memory restrictions, including only 256 KB of local storage and no cache. Programming a Cell/B.E. system by hand requires careful management and planning of memory and data movement between the SPEs.
Gedae addresses the issues of memory management and data movement directly. The automated implementation of these issues simplifies development for a Cell/B.E. system. After altering the application to use a USB-based ADC (the PS3 does not have a PCI slot), the application can be easily moved. The process of optimizing it for the multicore architecture should take about two hours.
To optimize the application, the compute-intensive signal-processing portion of the application is partitioned for the six SPEs. The memory footprint of the program and data is taken into account during this process using the Schedule Parameters dialog to analyze the size of the threads that will be created for each partition. To reduce the size of the memory footprint, the automated strip-mining capability is used, allowing a set of audio vectors to be processed independently on each SPE instead of simultaneously. Additionally, a primitive that performs a column-wise sum of a matrix is identified as pushing the thread memory size over the limit. To fix the issue, the primitive is replaced with one that integrates a series of row vectors.
The Gedae trace table is used to analyze the performance. During this process, one primitive can be identified as being slow, and you can recode it to use a unity stride. Based on the processor load, you can also alter the distribution of the work. After two hours in the final optimization, four SPEs are used to do a majority of the preprocessing (one SPE per audio channel), including the band filtering of the frequency spectrum. The other two SPEs are used to combine the data in the correlation calculation of the spot formation. The PPE performs the detection algorithm and interfaces with the I/O devices. With this implementation, the application is able to process almost 15 frames each second on the Cell/B.E. system, providing a much smoother tracking of the train.
This article started out with a compute-intensive application that was ported to these systems:
| System | Processors | Sensors | Output | UI |
|---|---|---|---|---|
| PC-based simulation | 1 | Datafile of 4 recorded channels | Constellation display | Rendered scene |
| Multiprocessor DSP board (MCS AdapDev) | 4 500MHz PowerPC AltiVec 1 Pentium | ICS 610 ADC PCI board, 4 microphones | Directed Perception D46-17 Pan-Tilt Unit | Matrix Vision BlueFOX USB camera displayed using video for Gedae |
| Multicore system (PS3-Cell/B.E.) | 1 PPE, 6 SPEs | M-Audio Quattro USB device, 4 microphones | Directed Perception D46-17 Pan-Tilt Unit | Matrix Vision BlueFOX USB camera displayed using video for Gedae |
Here is the breakdown of the amount of time these tasks took:
- Simulation (4 weeks programmer time, yielding no change in performance);
- DSP (6 hours programmer time, yielding a gain of 3 Hz);
- PS3 (4 hours programmer time, yielding a gain of 15 Hz);
By following this example, you can reach these conclusions:
- Gedae helps you easily move the application to new hardware.
- Changes to the implementation are handled by automation and simple GUIs, not changes to code.
- With only minimal efforts, you can achieve relatively high performance gains.
Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Get your own copy of
"Gedae Portability: From Simulation to DSPs to Cell/B.E.,"
which is the original whitepaper from which this article was adapted.
- This article is a part of the unofficial
"partners" series:
- The first article in this series is "Core partners, Part 1: Build high-performance apps for multicore processors" (developerWorks, May 2007) about the RapidMind Development Platform, which provides a simple single-source mechanism to develop portable high-performance applications for multicore processors.
- The second article in this series is "Using DDT to clean up Cell/B.E. app bugs" (developerWorks, February 2008), which describes how to use Allinea Software's Distributed Debugging Tool (DDT) to debug complete Cell/B.E. applications, including multiple threads within a single Cell/B.E. processor and among clusters of Cell/B.E. processors.
- Read the porting workshop series to see how to
effectively port an existing financial application to the Cell/B.E. platform:
- "Processor porting strategies" (developerWorks, August 2007).
- "Original code analysis" (August 2007).
- "Initial performance results" (September 2007).
- "Mersenne-Twister random number generator" (September 2007).
- "Mixed-precision workloads" (October 2007).
- "Tying it all together" (October 2007).
- "Getting the most performance" (November 2007).
- Find out more from
"Porting practices: Compute-intensive applications"
(developerWorks, June 2007) to see how practices can help when you want to bring
a compute-intensive application to the Cell/B.E. architecture.
- Learn to
"Minimize recoding impact, Part 1: How to make an SPE and existing code work together"
(developerWorks, September 2007) to integrate Cell/B.E. functionality into existing projects.
"Minimize
recoding impact, Part 2: Removing obstacles to speedy performance"
(October 2007) shows you how to eliminate performance roadblocks as you integrate
Cell/B.E. functionality into existing projects.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture
® when you sign up to receive Cell/B.E. news in your newsletter.
- The
Cell Broadband Engine/Power Architecture notebook
is a blog-based resource that hosts
news,
as well as two instructional features -- the
"Forum watch"
of interesting questions and hot topics from the forum and the
"Infobomb"
series (short, precise, task-specific, quick-read knowledge bombs gleaned from
Cell/B.E. documentation).
Get products and technologies
- Start here to
learn more about Gedae products.
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the library of Cell/B.E. documentation.
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
Discuss
- Participate in the discussion forum.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
A founding member of Gedae, Steed is the head of product development. Prior to joining to Gedae, Steed worked with Gedae at Lockheed Martin where he was primarily responsible for developing the embeddable library of functions, including testing and creating a database and search utility. Since helping to found Gedae, Steed has been responsible for new product development. His most prominent project is the development of Gedae's new RTL language. Steed earned a computer science degree from Cornell University and a masters in computer science from North Carolina State University.
A co-founder of Gedae, Inc., in 2001, William Lundgren is the President and CEO. Prior to founding Gedae, Lundgren started his professional career at Corning Glass Works as a product development physicist. After leaving Corning, Major Lundgren was an active member of the US Air Force Institute of Technology and the USAF Research Laboratories, where he developed new speech and audio-processing technologies. Lundgren moved to RCA Advance Technology Laboratories (subsequently Lockheed Martin), where he spent 16 years leading the development of Gedae and acting as the program manager for eight different projects at ATL. He earned a BS degree in Physics from Rensselaer, and he earned BS and MS degrees in electrical engineering from USAF Institute of Technology. Lundgren is ABD for his PhD in electrical engineering from the University of Pennsylvania.
A founding member of Gedae, Inc., Barnes was a Principal Member of the Engineering staff at Lockheed Martin, Advance Technologies Laboratories. At ATL, Barnes was responsible for signal-processing systems software and hardware, single chip FFT design, design and implementation of direct digital frequency synthesizer, mapping of algorithms to parallel hardware, OQPSK modulation and demodulation on Thinking Machine CM2, and development of various software tools and applications. Barnes earned a degree in electrical engineering from Lehigh University and a masters degree in computer and information science from the University of Pennsylvania.
Comments (Undergoing maintenance)




