 | Level: Intermediate James Widmore Steed (jsteed@gedae.com), Director of Software Development, Gedae, Inc. William Lundgren (wlundgren@gedae.com), President and CEO, Gedae, Inc. Kerry B. Barnes (kbarnes@gedae.com), Chief Scientist, Gedae, Inc.
08 Apr 2008 This concise study examines the portability of
applications developed in Gedae by analyzing the work required to move an example
application from a simulation on a PC to actually running on a DSP board (the
Mercury Computer System AdapDev system) to running on a multicore Cell Broadband
Engine™ (Cell/B.E.). The article illustrates how architecture considerations were taken into account
when porting the application to each system. You can see the amount of work required to
port the application and the performance of the application on each system.
Introduction
This article takes you on a tour of how portable an application
designed with Gedae technology can be. The example takes an application running as
a simulation on a PC, ports it to a working application running on a Mercury Computer System AdapDev
system, and then ports it to a working application running on a
Cell/B.E. system.
You can see the work required to perform each step, paying close attention to
the architecture considerations needed when porting. You can learn about the amount of
work required to port the application and the performance of the
application on each system.
This article structure:
- Shows you the basic application
- Looks at the simulation
- Illuminates the multiprocessor and multicore implementations
Introducing the application
The example application involves tracking a model train as it goes in a circular
path around its track. The application uses input audio data from four microphones
placed in a circle around the track to locate the train in the audio field. Using
this location, the application pans and tilts a camera to point at the engine of
the train. An illustration of this environment is shown in Figure 1.
Figure 1. The tracking algorithm
targeting a train running on a track with four microphones as sensor inputs
The algorithm is based on RADAR technology. A beamformer correlates a linear
array of RADAR sensors to identify a target based on a beam of high correlation.
In this application, the array is circular, so the high intensity in the
correlation of the four channels forms a spot, as shown in Figure 2.
Figure 2. The correlation of the
four audio channels forming a spot of high intensity
After the spot formation forms this audio map, a detection algorithm identifies
the high intensity peak corresponding to the train. The pan and tilt angles are
computed to reposition the camera.
Because this application must continue to work in noisy environments, several
approaches are used to reduce jitter and ensure smooth tracking. The input
channels are run through low-pass filtering to remove frequencies outside the
desired band. In the detection algorithm, several peaks are identified and tested.
Feedback is used to monitor the speed and direction of the train to help rule
out spurious peaks in the correlation data.
Figure 3 shows how the application looks in Gedae.
Figure 3. How the application
looks in Gedae
Notice that the same flow graph was used for the simulation, quad DSP board, and
Cell/B.E. processor implementations.
Introducing the simulation
The application was first developed as a simulation. The environment of the
train and camera was simulated, and the four channels of audio data for the
microphone array were read from files. To show the results of the simulation, a 3D
rendering of the scene is presented from the view angle of the camera, as shown in
Figure 4.
 |
Gedae trace table
Gedae collects trace information with low overhead in a circular buffer kept on each processor.
Accurate clock synchronization between processors and nanosecond resolution
enables you to correctly determine the causal relationships between processors to quickly
solve blocking problems that span processors.
The Gedae trace table provides
the information needed to optimize performance. A summary timeline for each
processor enables you to make load-balancing decisions, while timelines for each
primitive enable you to identify slow primitives or primitives needing a
granularity increase. The capability to zoom and scroll through the timeline and to
collapse and reorder the location of hierarchical boxes in the trace table
simplifies navigating the trace table for large graphs. The timeline gives you the
information you need to choose the communication method that best meets the throughput
and latency requirements of your application. After you optimize performance, you
can rerun the graph and measure the improvements.
|
|
Figure 4. The simulation, including
a 3D-model rendering of the environment
Using Gedae-simulation for the example application, experiments were done on multiprocessor implementations
of the application to prepare to move it to hardware processing realtime data.
Once created, the code of a Gedae application does not have to be changed in order
to partition and map it to multiple processors. For the example application, several
mappings to virtual processors were used in different configurations. The
results were analyzed in the Gedae trace table.
Implementing multiprocessor DSP
To transition the example application to use real world data, it was ported to the
Mercury Computer System AdapDev system. The MCS AdapDev system provides an
Intel® Pentium host and two quad DSP boards where each DSP is a 500MHz AltiVec processor
(see Figure 5).
- Pentium III development host: 1.26GHz; 1GB SDRAM
- Quad PowerPC® 500MHz (MCP7410): AltiVec instruction set; 2 MB L2 cache; 256 MB
SDRAM; DMA engines
- RACE++ switched-fabric architecture
Figure 5. The Mercury Computer
System AdapDev system
Physical components for the camera, gimbal, microphones, and audio digital
converter (ADC) were assembled. The application was altered to remove the
artificial audio source and scene rendering, which were replaced with an interface to
the ADC (using PCI), the gimbal (using a serial port), and the camera (using USB).
While the sources and sinks were replaced to use real-world data, the algorithms
and their coding did not require changing to create a realtime implementation.
For the example, a partitioning and mapping scheme was
entered into the partition and map partition tables, as shown in Figure 6.
 |
Partitioning and transferring
One of the most powerful features of Gedae is the ease of partitioning and mapping an
application to run on multiple processing elements (not to mention repartitioning
and remapping). By creating an application as a flow graph, it is easy to
partition the graph into sections, and then map each of those sections to hardware.
This act of partitioning the graph provides information to the Gedae compiler,
which helps it plan the application's threads and adjust for the distribution you specified.
When a graph is ready to be distributed, you must first partition the graph. In
the partition table, you are presented with a table that lists all the components in the flow graph.
Simply select which components should be broken off into a new partition, and
assign them to a new partition name.
Mapping those partitions to target
hardware is just as easy. In the map partition table, you see a table listing all
the partitions you have just made in the partition table.
For each partition, select the processor number to which that partition will
be mapped from a predefined list.
Many transfer methods can be made
available in Gedae: from DMA to shared memory to processor-specific protocols, and
through the transfer table. You can easily select transfer methods for
each communication. The transfer methods are fully parameterized, allowing for
precise specification of buffer sizes and other parameters so that the most
efficient transfer can be used.
|
|
Figure 6. The partitioning and
mapping scheme
Notice that the application was mapped to multiple DSP processors without
changing the code.
The communication protocols can be tweaked using the transfer table by picking
direct schedule access transfers (equivalent to DMA) and removing the blocking of
the host-to-DSP transfers.
Additionally, you can use automated strip mining to optimize vectorization and
improve cache utilization. These changes, which include both changing the graph
to use real-world data and setting the implementation parameters, take about one day
of effort, and the resulting implementation can process three frames per
secondâenough to track the train at its maximum speed with few
errors or jitter.
Reviewing multicore implementation
You are probably already familiar with the Cell/B.E. architecture, so here's
a little refresher:
- Power Processing Element (PPE)
- Eight Synergistic Processing Elements (SPEs): VMX SIMD instruction set; DMA
engines; 256 KB local storage (LS)
- System memory
- Element interconnect bus (EIB): Over 200 GBps
Figure 7. The Cell/B.E.
architecture
To illustrate support for the Cell/B.E. platform, this application was ported to
the Sony Playstation 3 (PS3). The Cell/B.E. on the Sony PS3 provides a
dual-threaded PPE core, as well as 6 enabled SPEs. The SPEs are very efficient vector
processors, but they have strict memory restrictions, including only 256 KB of local storage and
no cache. Programming a Cell/B.E. system by hand requires careful management and
planning of memory and data movement between the SPEs.
Gedae addresses the issues of memory management and data movement directly. The
automated implementation of these issues simplifies development for a Cell/B.E.
system. After altering the application to use a USB-based ADC (the PS3 does not
have a PCI slot), the application can be easily moved. The process of optimizing
it for the multicore architecture should take about two hours.
To optimize the application, the compute-intensive signal-processing portion of
the application is partitioned for the six SPEs. The memory footprint of the
program and data is taken into account during this process using the Schedule
Parameters dialog to analyze the size of the threads that will be created for each
partition. To reduce the size of the memory footprint, the automated strip-mining
capability is used, allowing a set of audio vectors to be processed independently
on each SPE instead of simultaneously. Additionally, a primitive that performs a
column-wise sum of a matrix is identified as pushing the thread memory size over
the limit. To fix the issue, the primitive is replaced with one that integrates a
series of row vectors.
The Gedae trace table is used to analyze the performance. During this process,
one primitive can be identified as being slow, and you can recode it to use a unity
stride. Based on the processor load, you can also alter the distribution of the
work. After two hours in the final optimization, four SPEs are used to do a
majority of the preprocessing (one SPE per audio channel), including the band
filtering of the frequency spectrum. The other two SPEs are used to combine the
data in the correlation calculation of the spot formation. The PPE performs the
detection algorithm and interfaces with the I/O devices. With this implementation,
the application is able to process almost 15 frames each second on the Cell/B.E.
system, providing a much smoother tracking of the train.
Conclusion
This article started out with a compute-intensive application that was ported
to these systems:
| System | Processors | Sensors | Output | UI |
|---|
| PC-based simulation | 1 | Datafile of 4 recorded channels | Constellation display | Rendered scene |
|---|
Multiprocessor DSP board (MCS AdapDev) | 4 500MHz PowerPC AltiVec 1 Pentium | ICS 610 ADC PCI board, 4 microphones | Directed Perception D46-17 Pan-Tilt Unit | Matrix Vision BlueFOX USB camera displayed using video for Gedae |
|---|
Multicore system (PS3-Cell/B.E.) | 1 PPE, 6 SPEs | M-Audio Quattro USB device, 4 microphones | Directed Perception D46-17 Pan-Tilt Unit | Matrix Vision BlueFOX USB camera displayed using video for Gedae |
|---|
Here is the breakdown of the amount of time these tasks took:
- Simulation (4 weeks programmer time, yielding no change in performance);
- DSP (6 hours programmer time, yielding a gain of 3 Hz);
- PS3 (4 hours programmer time, yielding a gain of 15 Hz);
By following this example, you can reach these conclusions:
- Gedae helps you easily move the application to new hardware.
- Changes to the implementation are handled by automation and simple GUIs, not
changes to code.
- With only minimal efforts, you can achieve relatively high performance gains.
Resources Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Get your own copy of
"Gedae Portability: From Simulation to DSPs to Cell/B.E.,"
which is the original whitepaper from which this article was adapted.
- This article is a part of the unofficial
"partners" series:
- The first article in this series is
"Core partners,
Part 1: Build high-performance apps for multicore processors"
(developerWorks, May 2007) about the RapidMind Development Platform, which provides a
simple single-source mechanism to develop portable high-performance applications
for multicore processors.
- The second article in this series is "Using DDT to clean up Cell/B.E. app bugs"
(developerWorks, February 2008), which describes how to use Allinea Software's
Distributed Debugging Tool (DDT) to debug complete Cell/B.E. applications,
including multiple threads within a single Cell/B.E. processor and among clusters of
Cell/B.E. processors.
- Read the porting workshop series to see how to
effectively port an existing financial application to the Cell/B.E. platform:
- "Processor porting strategies"
(developerWorks, August 2007).
- "Original code analysis"
(August 2007).
- "Initial performance results"
(September 2007).
- "Mersenne-Twister random number generator"
(September 2007).
- "Mixed-precision workloads"
(October 2007).
- "Tying it all together"
(October 2007).
- "Getting the most performance"
(November 2007).
- Find out more from
"Porting practices: Compute-intensive applications"
(developerWorks, June 2007) to see how practices can help when you want to bring
a compute-intensive application to the Cell/B.E. architecture.
- Learn to
"Minimize recoding impact, Part 1: How to make an SPE and existing code work together"
(developerWorks, September 2007) to integrate Cell/B.E. functionality into existing projects.
"Minimize
recoding impact, Part 2: Removing obstacles to speedy performance"
(October 2007) shows you how to eliminate performance roadblocks as you integrate
Cell/B.E. functionality into existing projects.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture
® when you sign up to receive Cell/B.E. news in your newsletter.
- The
Cell Broadband Engine/Power Architecture notebook
is a blog-based resource that hosts
news,
as well as two instructional features -- the
"Forum watch"
of interesting questions and hot topics from the forum and the
"Infobomb"
series (short, precise, task-specific, quick-read knowledge bombs gleaned from
Cell/B.E. documentation).
Get products and technologies
Discuss
About the authors  | |  | A founding member of Gedae, Steed is the head of product development. Prior to joining to Gedae, Steed worked with Gedae at Lockheed Martin where he was primarily responsible for developing the embeddable library of functions, including testing and creating a database and search utility. Since helping to found Gedae, Steed has been responsible for new product development. His most prominent project is the development of Gedae's new RTL language. Steed earned a computer science degree from Cornell University and a masters in computer science from North Carolina State University. |
 | |  | A co-founder of Gedae, Inc., in 2001, William Lundgren is the President and CEO. Prior to founding Gedae, Lundgren started his professional career at Corning Glass Works as a product development physicist. After leaving Corning, Major Lundgren was an active member of the US Air Force Institute of Technology and the USAF Research Laboratories, where he developed new speech and audio-processing technologies. Lundgren moved to RCA Advance Technology Laboratories (subsequently Lockheed Martin), where he spent 16 years leading the development of Gedae and acting as the program manager for eight different projects at ATL. He earned a BS degree in Physics from Rensselaer, and he earned BS and MS degrees in electrical engineering from USAF Institute of Technology. Lundgren is ABD for his PhD in electrical engineering from the University of Pennsylvania. |
 | |  | A founding member of Gedae, Inc., Barnes was a Principal Member of the Engineering staff at Lockheed Martin, Advance Technologies Laboratories. At ATL, Barnes was responsible for signal-processing systems software and hardware, single chip FFT design, design and implementation of direct digital frequency synthesizer, mapping of algorithms to parallel hardware, OQPSK modulation and demodulation on Thinking Machine CM2, and development of various software tools and applications. Barnes earned a degree in electrical engineering from Lehigh University and a masters degree in computer and information science from the University of Pennsylvania. |
Rate this page
|  |