This article provides an overview and analysis of key hardware performance features of the IBM BladeCenter QS21. Although there is extensive literature about the hardware performance features of a single Cell Broadband Engine processor, as well as on the performance of a multitude of applications ported to it (see Resources), very little has been published on the specific hardware performance features of the IBM BladeCenter QS21 using a coherent SMP node of two Cell/B.E processors, as well as an elaborate I/O subsystem.
This article closes the gap by providing the following for an IBM BladeCenter QS21 system:
- Basic latencies and throughputs for the following:
- Transferring data within the processor complex consisting of two Cell/B.E. processors coupled in coherent SMP mode
- Networking using the two Southbridges and using the Gigabit Ethernet (GbE) controller with respect to the InfiniBand (IB) daughter card.
- Throughputs with respect to (relative) execution times for some key computational benchmark kernels, such as Linpack and SPEC2000.
Based on an analysis of these data, this article provides tips on how to optimize application performance on such a system. You can get an architectural overview, sections on the processor complex and I/O subsystem performance, and a glossary of related terms.
Understanding the architecture
The IBM BladeCenter QS21 basically consists of two subsystems, as shown in Figure 1.
Figure 1. High-level view of the QS21 architecture

The architecture is comprised of:
- The processor complex with two Cell/B.E. processors (BE0 and BE1) including attached XDR DRAM. These processors are connected using a coherent BIF bus. Details about the architecture of the Cell/B.E. processor can be found in Resources.
- The IO subsystem with two Southbridges (including attached DDR2 IO buffers), high-speed connectors (the HSDC and the HS connector), an InfiniBand daughter card (IB DC), and a Gigabit Ethernet (GbE) controller. These components are connected to the processor complex and among each other using various bus systems, such as the IOIF, PCI-E, and PCI-X buses.
The following sections address in detail the performance features of each of the two subsystems of the QS21 separately, as well at the impact each subsystem has on the performance of the other one.
Examining the processor complex performance
The performance of the processor complex (the blue boxes at the top of Figure 1) is largely determined by the latencies and throughputs for the following:
- Transferring data between the components of the processor complex
- The operations of the computational pipelines at the SPE and PPE cores of the Cell/B.E. processors.
Take a look at latencies and throughputs of transferring data using a set of benchmarks specifically developed for this purpose. The computational power of the cores of a QS21 is accessed using the Linpack and some SPEC 2000 benchmarks.
A single SPE accessing XDR DRAM focuses on the latencies and
throughputs of DMA get/put requests initiated by a
single SPE from or to the XDR DRAM attached to the two Cell/B.E. processors. The
example in this article is always concerned with the basic hardware DMA load or store requests
with a payload size of 128 bytes. DMA requests with larger payload sizes are
automatically unrolled by hardware to these basic 128-byte requests.
You can also see benchmark scenarios and results, such as SPE-initiated DMA get or put requests from or to one of the XDR DRAM memories either attached to the local or the remote processor. Sample data paths for the requests are shown in Figure 2.
Figure 2. Sample request paths

Table 1 shows DMA get latencies and DMA
get and put throughputs for
an SPE at the Cell/B.E. processor BE0 (SPE@BE0) with respect to the Cell/B.E.
processor BE1 (SPE@BE1) accessing the XDR DRAM attached to BE0 (XDR0) with respect
to BE1 (XDR1).
Table 1. Appropriate benchmark results
| Scenario | DMA get latencies (cycles) | DMA get throughput (GBps) | DMA put throughput (GBps) |
|---|---|---|---|
| SPE@BE0-XDR0 | 601 | 11.8 | 12.5 |
| SPE@BE0-XDR1 | 969 | 6.9 | 11.8 |
| SPE@BE1-XDR0 | 979 | 6.8 | 7.3 |
| SPE@BE1-XDR1 | 939 | 7.3 | 7.6 |
Measuring store latencies only accounts for the time it takes to issue the request to the EIB bus, so the latencies are not very meaningful. All data accesses have been aligned to 128 bytes, and data accesses are in only one page, so there have been no TLB misses.
To understand the latency and throughput benchmark results presented in Table 1,
it is important to look at a breakdown of a DMA get
request that an SPE generates into a sequence of elementary operations:
- The DMA
getrequest is created at the SPE. - The request is moved through a hierarchy of so-called address concentrators from the SPE to the root address concentrator AC0, which is only enabled on the Cell/B.E. processor BE0. (This asymmetry results in two additional crossings, or hops, of the BIF bus for requests generated at BE1, which impacts BE1 latencies and throughputs.
- AC0 reflects this request to all units attached to the EIBs on both Cell/B.E. processors for snooping. At this point, AC0 might additionally trigger a speculative execution of the load request to be executed concurrently to the ongoing reflection.
- All units send back a response to the root node of the hierarchy of the snoop response combiners SRC0, which is also only enabled on the Cell/B.E. processor BE0.
- The SRC0 processes all the responses and sends out a combined response to all units attached to the EIBs on both Cell/B.E processors using a hierarchy of snoop response combiners. This combined response contains the request to be executed, which is the request to retrieve data from XDR DRAM by the appropriate MIC.
- The request is executed. In other words, the appropriate MIC retrieves the data from the XDR DRAM and sends it to the requesting SPE along the data path outlined in Figure 2 (using the EIB bus and BIF bus if the SPE is located on the other Cell/B.E. processor).
To summarize, a DMA get request generates other requests
required to implement the cache coherency protocol, requests to snoop the caches
of the other processing cores on the same, and the remote Cell/B.E. processor.
These additional cache coherency protocol-related requests result in an increase
of latency of the original DMA get request. This
increase can be especially significant when BIF bus crossings are involved on its
critical path.
Analysis of benchmark results
The scenario with the least latency and highest throughput is the one with an SPE at BE0 accessing data at XDR0, at its local memory (see first row in Table 1). This is because in this scenario there are fewest BIF hops involved, namely only two. Furthermore, speculatively retrieving data from XDR0 can also improve performance here. Most of the latency is spent for retrieving the data from the XDR DRAM and for executing the cache coherency protocol.
All other scenarios of Table 1 involve a component located at BE1 (either an SPE or XDR DRAM) and have significantly higher latencies with respect to lower throughputs caused by two additional BIF hops either for data or requests associated with the cache coherency protocol on the critical execution path. Note that this is also the case for an SPE at BE1 accessing its local XDR DRAM attached to BE1.
The difference of the DMA put throughputs between
scenarios with an SPE at BE0 initiating the put requests
to scenarios with an SPE at BE1 initiating the put
requests is caused by an asymmetric treatment of so-called retry requests due to
the concentrators AC0 and SRC0 being only enabled on BE0. Details of that are
beyond the scope of this article.
Some tips
- Because of the distinctive NUMA behavior of the processor complex, ensure
locality of data accesses whenever possible by using the Linux®
numactlfacility to bind processes to the processor next to its data. - Align all data accesses to 128 bytes.
- Use the DMA engines of the SPEs for concurrently transferring data between the local stores of the SPEs and the XDR DRAM and executing calculations at the SPEs. This can be implemented using multi-buffering schemes.
Several SPEs concurrently accessing XDR DRAM
Now look at the latencies and throughputs of DMA
get and put requests
concurrently initiated by several SPEs from or to the XDR DRAM attached to the two
Cell/B.E. processors.
Benchmark scenarios and results
Figure 3 shows sample data paths for four SPEs concurrently accessing their local memory.
Figure 3. Sample data paths

Figure 4 shows the benchmark results.
Figure 4. Benchmark results

Analysis of benchmark results
The graphs representing the throughput as a function of the number of the SPEs posting requests have the typical knee as expected. The throughput grows linearly as a function of the number of requests (SPEs) for a small number of SPEs. The throughput becomes constant for a large number of requests (SPEs), with this constant throughput being determined by the service center with the highest latency, namely the MIC. In the example, it takes approximately three SPEs at BE0 concurrently posting requests against their local XDR DRAM to max out the available bandwidth. In the case of BE1, it takes approximately four SPEs due to the increased latency of local memory requests that an SPE at BE1 initiates.
The maximum aggregate throughput with respect to memory of approximately 50 GBps is very close to the maximum possible throughput of 51.2 GBps, and it is quite an outstanding number.
A tip
- Avoid scenarios with all SPEs concurrently accessing XDR DRAM, because memory bandwidth might be maxed out already by four SPEs concurrently posting DMA get or put requests.
A single SPE accessing the local store of another SPE
This section focuses on the latencies and throughputs of DMA
get and put requests initiated
by a single SPE from or to the local store of another SPE of its own or the other
Cell/B.E. processor.
Benchmark scenarios and results: Sample data paths for this scenario are presented in Figure 5.
Figure 5. Sample data paths for SPEs initiating DMA requests from/to the LS of another SPE

The appropriate benchmark results are presented in Table 2.
Table 2. Appropriate benchmark results
| Scenario | DMA get latencies (cycles) | DMA get throughput (GBps) | DMA put throughput (GBps) |
|---|---|---|---|
| SPE@BE0-SPE@BE0 | 235 | 22.1 | 22.3 |
| SPE@BE0-SPE@BE1 | 954 | 5.9 | 5.9 |
| SPE@BE1-SPE@BE0 | 969 | 4.1 | 3.7 |
| SPE@BE1-SPE@BE1 | 234 | 23 | 22.8 |
The DMA get latencies and the DMA
get and put throughputs for
an SPE at the Cell/B.E. processor BE0 (SPE@BE0) with respect to the Cell/B.E.
processor BE1 (SPE@BE1) accessing the local store of another SPE attached to BE0
with respect to BE1. Measuring store latencies only accounts for the time it takes
to issue the request to the EIB bus. The latencies are not very meaningful.
Analysis of benchmark results
The scenarios with the least latencies and highest throughputs are the ones with an SPE accessing data of a local store of an SPE at the same BE (see the first and last rows in Table 2). Compared with the benchmark results of Table 1 related to an SPE accessing XDR DRAM, there are some remarkable differences. The latencies for accessing data of an LS of a SPE at the same BE are:
- Significantly smaller than the latencies accessing local XDR DRAM
- identical for SPEs at BE0 and SPEs at BE1 (up to statistical measurement errors < 5 percent)
This is the case because for benchmark scenarios involving DMA operations only between SPEs at the same BE, the global coherency protocol is deactivated and therefore there are no (expensive) BIF hops involved.
All other scenarios of Table 2 (rows 2 and 3) involve BIF hops, and the latencies are similar to the latencies of the analogous scenarios of Table 1. The decrease of throughputs in these scenarios is caused by so-called retry requests and their asymmetric handling caused by the concentrators AC0 and SRC0 being enabled on only BE0 (among other things). Details about this are beyond the scope of this article.
A tip
- Minimize communication between remote SPEs. Communicate between the two Cell/B.E. processors as you would in a cluster, if feasible.
A single PPE accessing XDR DRAM
This section focuses on the latencies and throughputs of load and store requests the PPE initiates from or to the XDR DRAM attached to the two Cell/B.E. processors.
Benchmark scenarios and results
Sample data paths for the requests under consideration in this section are shown in Figure 6.
Figure 6. Sample data paths

The appropriate benchmark results are presented in Table 3.
Table 3. Appropriate benchmark results
| Scenario | Load latencies (cycles) | Load throughput (GBps) |
|---|---|---|
| PPE@BE0-XDR0 | 561 | 4.2 |
| PPE@BE0-XDR1 | 929 | 2.5 |
| PPE@BE1-XDR0 | 947 | 2.5 |
| PPE@BE1-XDR1 | 899 | 2.7 |
Table 3 shows PPE load latencies and throughputs for accessing local with respect to the remote XDR DRAM. Measuring store latencies only accounts for the time it takes to issue the request to the EIB bus. The latencies are not very meaningful. The store throughputs are comparable to the load throughputs here in most cases.
Analysis of benchmark results
The scenario with the least latency and highest throughput is the one with the PPE at BE0 accessing data at XDR0, at its local memory. In this case, there are the fewest BIF hops involved, namely only two. Furthermore, speculatively retrieving data from XDR0 can also improve performance in this scenario. Most of the latency is spent here for retrieving the data from the XDR DRAM and for executing the cache coherency protocol.
All other scenarios involve a component located at BE1 (either the PPE or XDR DRAM). The scenarios have significantly increased latencies with respect to decreased throughputs caused by two additional BIF hops either for data or command requests associated with the cache coherency protocol on the critical execution path. Note that this is also the case for the PPE at BE1 accessing its local XDR DRAM attached to BE1.
Some tips
- Because of the distinctive NUMA behavior of the processor complex, ensure
locality of data accesses whenever possible by using the Linux
numactlfacility to bind processes to the processor next to its data. - To achieve the best possible throughput, ensure that as many
load or store requests as possible (up to 6) are concurrently in flight between the L2
cache and the XDR DRAM by using
dcbtordcbzinstructions directly in assembler with respect to using an optimizing compiler.
Now look at such computational kernels as Linpack and SPEC CPU2000.
Linpack
The Linpack benchmark measures the throughput (in GFLOPS) achieved by solving a random dense system of linear equations and is used as a basis for assessing the floating-point performance of various computer systems. (For more, see Resources.)
Benchmark results
Using an IBM internal single-precision version of the Linpack benchmark (the double-precision results will be provided in the future for the IBM BladeCenter QS22), the example yielded a throughput of 342 GFLOPS (out of maximum of 409.6 GFLOPS) on an IBM BladeCenter QS21 using 16 SPEs.
Analysis of benchmark results
This rather outstanding throughput was achieved by using a highly optimized matrix multiplication for the SPEs and by minimizing the traffic over the BIF bus. It demonstrates that throughputs close to the theoretical peak performance can well be achieved on an IBM BladeCenter QS21.
SPEC CPU2000
SPEC CPU2000 is an industry-standard CPU-intensive benchmark suite that runs on the PPE only. It does not make use of the SPEs, and therefore this benchmark suite is not well-suited to assess the benefits of the architecture of the Cell/B.E. processor.
Benchmark results
- SPECint2000 base: 423 (using the gcc 4.1.1 for the PPE)
- SPECfp2000 base: 387
Analysis of benchmark results
The SPECint2000 base results can be explained by the in-order, non-superscalar nature of the PPE architecture.
Some tips
Take as much load as possible from the PPE by:
- Carefully optimizing performance critical PPE code
- Offloading work to the SPEs
- Using the IBM BladeCenter QS21 as an accelerator for another system with a more advanced general purpose processor core either using InfiniBand (see Examining networking performance) or in the form of a Triblade (see Roadrunner in Resources).
Examining networking performance
Now look at the networking performance between two IBM BladeCenter QS21 blades within a common BladeCenter using the appropriate switches for IB in respect to GbE.
The performance of the I/O subsystem of a BladeCenter QS21 (the green boxes in the blue eclipse in Figure 7) is largely determined by the latencies and throughputs for transferring data between the midplane of the BladeCenter and the processor complex using the Southbridges and the IB daughter card with respect to the GbE controller. Keep in mind that networking requests can also consume resources of the processor complex, especially resources of the two PPEs.
Figure 7. QS21 architecture

Now look at the networking performance of the BladeCenter QS21 using key networking benchmarks.
Look at the InfiniBand (IB) latencies and throughputs on the remote direct memory access level, as well as on MPI level using the Open MPI implementation of the Open Fabrics Enterprise Distribution (OFED-1.2.5). The wikipedia reference on the message-passing interface and the Linux open fabrics wiki in the Resources section promises more details on MPI and on the OFED software stack.)
Benchmark scenarios and results
Figure 8 shows sample data paths for transferring data to and from the network and to and from the XDR DRAM attached to the BE0 processor of the IBM BladeCenter QS21 processor complex using the BE0, BIF bus, BE1, Southbridge1, and the IB daughter card.
Figure 8. Sample data paths

The DMA engine initiates the data transfer requests on the IB daughter card and bypasses the processing cores of the Cell/B.E processors. Only setup and control of these requests at the DMA engine of the IB daughter card are done by a Cell/B.E. processor.
The associated benchmark results are presented in Table 4. These are minimum latencies and maximum throughputs for communication between two IBM BladeCenter QS21 blades using IB. The parameter n1/2 stands for the payload size (in bytes) required to achieve 50 percent of the maximum available bandwidth.
Table 4. Associated benchmark results
| Scenario | Minimum latencies (microseconds) | Maximum throughput (MiB per second*) | n1/2 (bytes) |
|---|---|---|---|
| rDMA | 6.32 | 916 | ~870 |
| MPI PingPong | 8.34 | 726 | ~37,000 |
| MPI Streaming | N/A | 916 | ~12,200 |
*1MiB = 2^20B, 1MB = 10^6B, 1MiB ~ 1.05MB
Analysis of benchmark results
The measured throughputs on rDMA level as well as for the MPI benchmarks are quite close to the highest possible bandwidth of the 4x SDR IB daughter card. Because of the larger latency on MPI versus rDMA level, rDMA needs significantly more load (generated by using larger payload sizes) to approximately achieve the highest possible throughput using the MPI PingPong benchmark. This explains the increase in the parameter n1/2 describing the payload size (in bytes) required to achieve 50 percent of the maximum available bandwidth.
Some tips
Using a low-level IB protocol such as rDMA provides you with the following advantages:
- Low latencies
- Maximum throughput (when using large data packets > 400KB)
- Low utilization of the PPEs
Therefore, using a low-level protocol for IB (such as rDMA) is the best choice for doing high-performance networking with the IBM BladeCenter QS21.
Now look at Gigabit Ethernet (GbE) latencies and throughputs as measured using the standard NetPipe and Netperf benchmarks.
Benchmark scenarios and results
Figure 9 shows sample data paths for transferring data to and from the network and to and from the XDR DRAM attached to the BE0 processor of the BladeCenter QS21 processor complex using the BE0, Southbridge0, and the GbE controller.
Figure 9. Sample data paths

The DMA engine initiates the data transfer requests on the GbE controller and bypasses the processing cores of the Cell/B.E processors. Only setup and control of these requests at the DMA engine of the GbE controller are done by a Cell/B.E. processor.
The associated benchmark results are presented in Table 5.
Table 5. Associated benchmark results
| Scenario | Minimum latencies (microseconds) | Maximum throughput (MiB per second*) | Approx. n1/2 message size(bytes) |
|---|---|---|---|
| NetPipe PingPong | 45.2 | 118 | ~25,000 |
| Netperf (2 ports) | N/A | 171 |
*1MiB = 2^20B, 1MB = 10^6B, 1MiB ~ 1.05MB
Analysis of benchmark results
The measured throughput for the NetPipe PingPong benchmark is close to the maximum possible bandwidth of 1Gbps. The latency is impacted by the TCP/IP protocol overhead, and it is significantly larger than the one measured for IB using the rDMA level.
The use of two ports results in a significantly improved throughput, but it is finally limited by a high PPE utilization.
Some tips
To achieve maximum throughput when using GbE for networking:
- Use large data packets (> 1MB).
- Use a large MTU size (like 9,000).
Still, the use of the GbE interface for networking should be restricted to tasks with modest networking performance requirements (such as many administrative tasks). For tasks with high bandwidth and low latency requirements, the IB interface should be used.
This article provided an in-depth introduction to the performance features of the IBM BladeCenter QS21, focusing on the performance of transferring data within the processor complex, as well as between two BladeCenter QS21 systems using the available networking interfaces. You also read some insights into the performance of important computational kernels and read hints and tips on how to get the best performance out of the BladeCenter QS21.
The author would like to thank Daniel Hackenberg (TU Dresden) for optimizing the Linpack benchmark for a BladeCenter QS21; Thomas Staudt and Dominik Klein (IBM Boeblingen) for providing their SPEC CPU2000 benchmark results; Dave Krolak, Abe Ouda, and Scott Clark (IBM Rochester) for explaining the design details of the Cell/B.E. bus subsystem; the firmware and infrastructure teams at IBM Boeblingen for their enduring support of the performance efforts, especially during the early phases of development; and finally management for their enduring support of the performance team.
Learn
- Refer to Analyzing Computer Systems Performance with Perl: PDQ by Neil Gunther (SpringerVerlag, 2005, pp. 92-93) for an explanation of the
typical knee in
the graphs represented in the section "Several SPEs concurrently accessing XDR DRAM".
- Read
"The LINPACK Benchmark: Past, Present, and Future"
to clear up any confusion and mystery surrounding the LINPACK benchmark and some
of its variations.
- Look at the scheduling brochure
Cell/B.E.-Opteron hybrid supercomputer at LANL, Roadrunner.
- Explore the
Linux openfabrics.org Wiki
for information about the Open MPI implementation of the Open Fabrics Enterprise
Distribution (OFED).
- Check out
NetPipe (a protocol independent
performance tool that visually represents the network performance under a variety
of conditions) and Netperf (a
benchmark that can be used to measure the performance of many different types of
networking, including testing for both unidirectional throughput and end-to-end
latency).
- Find out more about best practices for Cell/B.E.
development in the IBM Redbook™ draft
"Programming the Cell Broadband Engine Examples and Best Practices"
(IBM Redbooks, February 2008).
- Get answers to Cell/B.E. SDK 3.0 installation
questions in the original installation document,
"Installation Guide for the SDK for Multicore Acceleration v3.0."
- Read
"Introduction to the Cell Multiprocessor"
(IBM Journal of Research and Development, 2005) for an introductory
overview of the Cell/B.E. multiprocessor's history, the program objectives and
challenges, the design concept, the architecture and programming models, and the
implementation. Also of interest from early Cell/B.E. Architecture efforts is
"Cell Broadband Engine Architecture and its first implementation"
(developerWorks, November 2005).
- Find the
"Cell/B.E.
SDK 3.0 tools: Using performance tools" tutorial series
(developerWorks, April 2008) for a tour of six performance tools for use
with the Cell/B.E. SDK 3.0 and for Cell/B.E. system performance best
practices.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture® when you sign up to receive Cell/B.E. news in your newsletter.
- The
Cell Broadband Engine/Power Architecture notebook
is a blog-based resource that hosts
news,
as well as two instructional features -- the
"Forum watch"
of interesting questions and hot topics from the forum and the
"Infobomb"
series (short, precise, task-specific, quick-read knowledge bombs gleaned from
Cell/B.E. documentation).
Get products and technologies
- See that
SPEC CPU2000 is the next-generation
industry-standardized CPU-intensive benchmark suite designed to provide a
comparative measure of compute-intensive performance across the widest practical
range of hardware. The implementation of these source code benchmarks was
developed from real user applications, and they measure the performance of the
processor, memory, and compiler on the tested system.
- Check out the
alphaWorks
Interactive Ray Tracer for Cell
Broadband Engine, which is
a proof-of-technology visual demonstration of the graphics power of the Cell
Broadband Engine for realistic real-time animation on the Playstation 3 or QS21
platforms.
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the library of Cell/B.E. documentation.
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
Discuss
- Participate in the discussion forum.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
Juicy problems and answers from the forums are rounded up periodically and highlighted
in the "Forum watch" blog series.
- Go to the Power Architecture blog for news, downloads,
instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find
the popular "Forum watch" blog series (Q&A roundup), the "FixIt" technology updates, and the Infobomb quick-read
technology introductions.
Dr. Peter Altevogt is a performance architect in the IBM Systems and Technology Group at the IBM Laboratory Boeblingen (Germany). He built the performance team for the IBM Blade computer using the Cell/B.E. processor. His other responsibilities include performance analysis and modeling of future IBM processors and systems. Dr. Altevogt holds degrees in Mathematics and Physics from the University of Heidelberg, and he holds a doctorate in theoretical physics from the University of Karlsruhe. He joined the IBM Scientific Center in Heidelberg in 1991, and he moved to the IBM Laboratory Boeblingen in 1998.
Hans Boettiger works in IBM Systems and Technology Group at the IBM Germany Development Lab. He joined IBM in 1973. He has held various technical leadership positions in software, operating systems, and hardware development for mainframes, as well as in performance analysis for BI systems, compilers, and blade computers. He currently works as a performance architect on next generation systems.
Tibor Kiss is a performance engineer at the IBM Laboratory Boeblingen (Germany). Since 2005, he has been a member of the IBM Systems and Technology Group performance team, responsible for the performance of the IBM Blades using the Cell/B.E. processor. He holds a Bachelor of Science degree in Computer Engineering. His interests include performance analysis and modeling.
Zvonko Krnjajic is a software engineer at the IBM Laboratory Boeblingen (Germany) working on performance analysis of Cell/B.E.-based blades. His other interests include graphics on the Cell/B.E processor (he did his diploma thesis on implementing graphics algorithms on the Cell/B.E. processor at the IBM Laboratory Boeblingen). He holds a Bachelor's degree from the University of Esslingen, and he is currently working on his master's thesis in the area of Distributed Systems Engineering with a focus on general purpose computing on GPUs. He is also interested in High Performance Computing and Cryptography.





