Complex technologies often require complex tools for maximum performance. The Cell BE processor is a "heterogeneous" processor in that it has a PowerPC® core (also known as the Power Processor Element, PPE) and eight Synergistic Processing Elements (SPEs) that have a completely different instruction set. To get the maximum performance out of the Cell BE processor, you must put a great deal of forethought into modeling the problem and into implementing the solution.
We have just gotten through working on a series of tutorials about the XL C Alpha download (see Resources), and were lucky enough to catch up with the XL C compiler, to ask about some of the questions we had as a result. Readers should view this interview as a companion to the tutorial pieces.
developerWorks: Before we start talking about optimizing code, can I ask you to introduce yourself and just say a few words about what you do?
<standard input>, line 1.1: 1506-166 (S) Definition of function Before requires parentheses.
<standard input>, line 1.8: 1506-046 (S) Syntax error.
setenv LOCALE en_US
Sorry about that. Before we start talking about optimizing code, can I ask you to introduce yourself and just say a few words about what you do?
ppuxlc: Certainly. I'm the XL C download, sort of a demo or trial version, but I come from a large family of compilers, built from a single source tree. Compiler researchers are adding new features, while the commercial releases are focused more on stability. We primarily target Power Architecture™ chips, such as the POWER6™, or the PowerPC 970MP, but I also target the SPE units of the Cell Broadband Engine processor.
dW: Do you only run on Power Architecture systems, then?
ppuxlc: No, in fact, I run on x86 or x86_64 Linux® systems. Other members of the family are available for a range of architectures, and we're quite portable. We even have a few Fortran compilers, although some might say they're the black sheep of the family. <laughs> There are basically two branches to the family tree: the commercial branch is more stable and polished, the research branch is a little more experimental. Features that work out well in testing get moved into commercial releases over time.
dW: What's the relationship between you and the research compiler?
ppuxlc: My uncle, the Octopiler, is a little eccentric at times, being a research project. I have prototype extensions to support the pipelines of the PPE and SPE. The automatic SIMDization support for the SPE is a new feature, but it's based on the fairly mature VMX support in the production compiler.
dW: What do you mean, automatic SIMDization?
ppuxlc: Auto-SIMDization is automatic conversion of code written in terms of scalars and loops into instructions that run natively on SIMD hardware like the SPEs on Cell BE or the VMX registers on the PowerPC (including the PPE of Cell BE). Auto-SIMDization helps to parallelize operations within a single SPE, or on the PPE, but you will still need manual intervention to partition code and data, and to exploit the parallelism of the eight SPEs.
dW: So I need to do some hand-tuning work to exploit the Cell BE fully?
ppuxlc: Yes. The current combination of the XL PPE and SPE compilers allows you to exploit the features of the Cell BE, but still requires a high degree of manual intervention, as described in Part 4 of the five-part tutorial series, Introduction to compiling for the Cell Broadband Engine architecture.
dW: What's this we hear about a tutorial series?
ppuxlc: Well, I guess there's a tutorial series about me. I'm a little embarrassed about all the attention, I'm mostly a very output-oriented worker, more comfortable in labs and development environments than with media exposure. The tutorials are based on slides from the PACT 2005 presentations by the compiler team; and are primarily about targeting the Cell Broadband Engine, and I guess they go into some of the technology used in the research branch of the family.
dW: What topics does the tutorial series cover?
ppuxlc: There's an overview, then there's more detailed pieces on parallelizing code, converting it automatically to run on SIMD units, partitioning it across memory spaces, and doing automatic memory management. It's pretty thorough.
dW: So, the tutorial describes the research compiler?
ppuxlc: Mostly, yes. The solutions discussed in the first three tutorials are implemented in Uncle Octopiler, but the techniques are straightforward enough that even hand-rolled code will benefit. The first three tutorials are also great primers for understanding the fundamentals of Cell BE architecture.
dW: You talk about "the first three". What about the others?
ppuxlc: The last two tutorials deal with more esoteric aspects, such as cloning stacks for parallelization and software cache memory management for accessing irregular data. This is more specific to a particular implementation in many cases. So, unlike the first three, the last two describe work that is currently "under construction," and give insight into how far a completely automatic compiler can go and the problems encountered getting there, and -- provided all goes to plan -- a glimpse of XL's immediate future. The work described there is prototype work in the Research compiler to have the compiler manage resource code and data transfers. It's under active development, and will be merged into production-ready code once it is indeed production-ready.
dW: Would it be correct to say that the compiler team's goal, as presented in the tutorial series, is to achieve a degree of compiler assistance to where it is not necessary for a developer to differentiate between PPE code and SPE code? You want the developer to just be able to type code and have it optimized automatically, is that correct?
ppuxlc: Yes and no... I am targeted at programmers with a wide range of expertise, and I want to support all of them. People that really want to squeeze the last bits of performance out of a chip, such as game developers, will need to hand-tune code; no compiler will be able to do this automatically.
dW: How do you target a broad range of expertise?
ppuxlc: I like to think of it as sort of a spectrum of approaches. If you hand-tune code, I won't have to do much with it. If you give me untuned C, though, I can put in some work analyzing and optimizing it. In the research branch, there's more support for significant behind-the-scenes work to take generic code and make it run efficiently; that's sort of the "we'll do it all for you" approach, where even basic loops might be automatically spun off to an SPE without a hint from the programmer. The idea is to give you good performance right away, and let you decide where to focus your tuning efforts.
You can hand-tune each program for each part of the architecture, or you can use explicit SIMD coding, using, for example, intrinsics, or you can take advantage of full access to the primitives that let you move data to and from the different SPEs as well as global memory. But if you don't want to dive in that deep, or if you are for instance in the initial phase of a project, where you want to have a proof-of-concept or proto-code working quickly, then my automatic tools are really very good.
So it's really a wide range. There is not a single goal, [but rather] a wide range of goals.
dW: How does a developer choose between XL C and GCC?
ppuxlc: Over the course of development, a common ABI was developed by the members of the Sony, Toshiba, and IBM teams, for the SPE environment. The ABI standard defines things like register usage, linkage conventions, stack layout... Both the XL and GCC SPE compilers are compliant with this ABI standard, and thus, we are able to interoperate on the platform. Our goal is to be as interchangeable as possible. We're very good friends, and we agree completely about a lot of things, such as register usage conventions.
dW: As in, all the developer has to do is substitute? If you want to start with GCC, but then decide you want to go XL C, all you've got to do is substitute the libraries you're linking against?
ppuxlc: Yes. One should be able to swap even within the same program, especially having parts compatible with one and parts compatible with the other one. The other thing that is wholly shared is, for example, we share the same assembler, the linker, which are all GNU tools. So, essentially, I am integrated with the GNU toolchain as well. So the developer may choose to compile with GCC, or with XL, and the resulting binaries -- since both compilers adhere to the ABI -- can be linked together and will execute correctly.
dW: That's something I think a lot of people would like to know.
ppuxlc: My team wants to make sure that all the libraries, the function calls -- like the runtime environment -- are similar, so that code compiled with either compiler is fully compatible. That's a huge priority. So, for instance, code I generate links with code from the version of GCC in the Cell BE SDK download.
dW: So, somebody just getting started on Cell BE and moving towards building on the Cell BE platform can go and get the GCC toolchain, and we know that that's a fairly mature environment, relatively speaking, but if they do find limitations, they can move straight over to XL.
ppuxlc: Absolutely, and vice versa. But you should be careful to not confuse the GCC compiler with the GNU toolchain: there is a single toolchain, within which the two of us, GCC and I, can be used interchangeably. So they could go with XL. If you turn that on for a single program or a single piece of the program, if the GCC were to do a better job they could also go that route.
dW: Are there differences in approach or features between the compilers?
ppuxlc: Yes. I support automatic SIMDization, and the XL compiler family supports automatic parallelization and partitioning; GCC plans to go as far as some level of automatic vectorization support, but there's no current plans for automatic partitioning and parallelization.
dW: Does this mean that XL C does not have an integrated debugger, linker, and disassembler? Or, is it just that the Alpha [evaluation] edition doesn't include them?
ppuxlc: The GNU toolchain is used to target the SPE; for instance, the GNU assembler and linker are used to build and link object modules.
dW: Okay. Does this swappability apply to regular old, unoptimized code only, though? Once a developer invests time in handrolling code for GCC, or writing code based on the approaches implemented in XL C -- it seems like it would become much less portable or interchangeable. Is this correct, and how much effort is being made to allow independent developers to maintain a single source code tree?
ppuxlc: The more carefully code is tuned, the less it will generally matter which compiler you use. Most optimizations are not going to be specific to one compiler or another, and code built with one can always be linked together with code built with the other.
dW: [Turning back to the tutorial series,] a lot of the techniques in the first three parts of the tutorials, specifically in Parts 2 and 3, can be implemented by hand by a programmer. In particular, instruction starvation is a constant concern. In the original PACT slides, this is dealt with by using "ifetch." What is ifetch?
<laughs> Oh, "ifetch" is just what I call it. Sort of short
for "instruction fetch". As far as the actual instruction, it's a kind of
branch hint, the
hbr instruction with the P flag, an
dW: Let me ask you about alignment for the dual instruction issue; say you have an indication that no-ops and HBRs should be inserted. Is there a particular preference for which side should a no-op go and which side should an HBR go?
ppuxlc: HBR operations are memory operations, so they always go to the memory unit. There are two or three situations for bundling or the dual issue. Sometimes it's just right, so there is nothing to do. Sometimes the two instructions are swapped in memory, so there I need to un-swap them. The problem is that I cannot always swap them or un-swap them, so there are some conditions. So what I do is I transform the dependency graph, especially what my scheduler sees, so that the scheduler will never put two things together that cannot be swapped. That's a high-level explanation. This is in the second tutorial, on SPE optimizations.
Then there's a third situation where essentially the two instructions that want to go together happen to be on two different instruction bundles, like a group of two instructions. There, the only way to make them go together is to insert a no-op prior to that. So that means there are really three situations, and in one of them we need no-op, and in the other one we just need to swap them.
dW: So no-ops can go to either of the functional units?
ppuxlc: That's true. So what I do is try to stick in a no-op so that this no-op will dual-issue with another instruction. For example, if in a prior cycle I see that in the execution record there would be just one arithmetic, but no memory, then I would insert a memory type no-op. If it was the reverse, for instance if there was only a memory instruction in the prior cycle, then I would put a no-op on the arithmetic side.
dW: Does an HBR or ifetch make it through the instruction issue, or do you just execute it and take it out of the stream?
ppuxlc: It goes in the memory pipe. It looks very much like a memory issue, so it prevents any other instruction to local store from the memory pipes, thus freeing a slot in the instruction buffer for the hint, but it's explicit.
dW: Could you substitute an ifetch for a lnop?
dW: How many cycles is it to fill an instruction buffer?
ppuxlc: All memory operations are pipelined, so it does take one cycle from a pipeline perspective. It takes about 15 cycles for the actual data to be transferred.
dW: So the memory functionally can take new instructions in that time and possibly even execute them, but the instruction fetch is still happening.
ppuxlc: Absolutely, exactly. It's truly pipelined, especially when it is an ifetch. If you look at a perfectly scheduled thing, you'll have lots of memory operations. Each of those takes six cycles to complete, but only takes one cycle of resources from memory. When it is an ifetch, it also takes only one cycle from the local store, meaning that the cycle after that, memory operations can proceed immediately. But there is latency for the operation to complete -- six cycles.
dW: So there is latency on the instruction fetch, but it's not blocking other operations.
ppuxlc: Yes. For one cycle, it prevents other operations; after that, they can continue, even though the fetch isn't complete.
dW: While we're on the ifetch, can you talk a little bit about the phrase "compiler scheduler?" Is this basically something that's counting how many instructions you have executed and it sticks in an ifetch?
ppuxlc: All compilers have a scheduler -- a scheduler, if you will, is a separate phase of the compilation process. A typical scheduler just does not have to worry about code layouts, normally. So that's why my scheduler is a little different.
Sometimes, in some other compilers, you will see a scheduler and a bundler -- it is sometimes two different phases: one which is dealing with the scheduling, which means deciding which instruction goes in which order, and then a later phase which is the bundling. That would be trying to address, for example, these dual-issue ordering rules, or the instruction fetch -- things like that. In XL, those two phases are merged, so the instruction scheduling and bundling happen at the same time.
dW: So you have a smart bundler.
ppuxlc: Yes, and also a smart scheduler. That's because sometimes a decision in the scheduler will impact the bundler, and sometimes a decision in the bundler will impact the scheduler. So, to get the best performance, the two were integrated.
dW: I'd like to talk a little bit, if we can, about the benefits of inlining versus looping. My argument has been that on register-starved architectures like x86, and even PowerPC without proper optimizations that, yes, not inlining can have performance benefits, but with proper optimizations inlining can eliminate a lot of the overhead that comes from the prolog and epilog of a function. Is this correct?
ppuxlc: Potentially, yes. The prolog and epilog are especially large on the Cell BE, because there are so many registers on the SPE, and you have to save and restore a lot of them: four times as many registers as a regular PowerPC architecture (128 instead of 32), and all of them are 128 bits. Inlining removes the save/restore code, which makes it more likely to reduce code size on the SPE than it would be on other processors.
dW: It seems to me you have to find this balance between the proper length of what you might inline versus when you might actually go ahead and do a function call.
ppuxlc: Absolutely, [but] if you do too much inlining on this machine you run out of local memory.
dW: Right. So it can get too large and not be able to fit into local store.
ppuxlc: Absolutely, or having too little space for the buffers, and since you have [to] DMA the data in and out, that would also lead to a [slowdown of the] program.
dW: Which is going to have a greater effect, then? Say you have a branch outside the other half of your instruction buffer, so your hint misses. Is that going to have a higher overhead or latency than going in and out of local store, or is local store a greater concern?
ppuxlc: Local store is a greater concern. I want to say that the hints are not a problem with the function call, because the function call is typically hinted, which means that if you have a program that currently is going to do a call, we can insert a hint ahead of time, so that essentially the change of direction of the program will happen seamlessly with no penalty.
dW: Where would a developer want to look in order to read up on what kind of modeling will help to make a good decision on whether to inline or to function call?
ppuxlc: I think essentially you want to inline where it gives you more performance from the code perspective. For example, if you have an innermost loop that is called repetitively and very actively, in that you definitely do not want any function call, because by removing the function call you expose more code. There will be more invariance that can be removed from the innermost loop [during automatic optimizations], and you will get essentially a lot of performance for a small increase in code size. However, if you inline functions that are rarely called, you pay all the costs of having a larger program with very little benefit.
dW: On that note, we'll wrap up. Thank you again for coming.
<standard input>, line 1.1: 1506-059 (S) Comment that started on line 1 must end before the end of file.
*Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.
This interview introduces a tutorial series on developerWorks; the
tutorial series in turn introduces you to compiling for
Cell Broadband Engine processor. The tutorial series is based on a presentation
originally given at
PACT 2005 by members of the IBM Research
Learn more about
XL C and
If you liked "interview with the
compiler," you may also enjoy these other interviews with these (much
anthropomorphized pieces of software: SatireWire featured an
with the search engine; and The Onion's Ask
a... advice column in 1999 featured Ask a
The IBM Semiconductor Solutions Technical Library Cell
Broadband Engine documentation section lists specifications, user
manuals, and more.
Find all Cell BE-related articles, discussion forums, downloads, and more at
the IBM developerWorks Cell Broadband Engine
resource center: your definitive resource for all things Cell BE.
Keep abreast of all the Cell BE -- and other Power Architecture-related
news: subscribe to the Power
Architecture Community Newsletter.
Get products and technologies
Get Cell BE: Contact
IBM E&TS for custom Cell BE-based or custom-processor based
Get the alphaWorks
Cell Broadband Engine SDK -- including the IBM Full System
Simulator, software samples and library code, and -- of course -- the
alphaWorks XL C Alpha edition.
downloads on one page.
- Participate in the discussion forum.
Take part in the IBM developerWorks Power Architecture Cell
Broadband Engine discussion forum.
Send a letter to the editor.
The developerWorks Power Architecture editors welcome your comments on this article. E-mail them at firstname.lastname@example.org.