ALF for Hybrid-x86 is an implementation of the ALF API specification in a system configuration with an Opteron x86_64 system connected to one or more Cell/B.E. processors. In this implementation, the Opteron system serves as the host, the SPEs on the Cell/B.E. BladeCenters act as accelerators, and the PPEs on the Cell/B.E. processors act as facilitators only.
From the ALF application programmer's perspective, the application interaction as defined by the ALF API is between the Hybrid-x86 host and the SPE accelerators.
This implementation of the ALF API uses the Data Communication and Synchronization (DaCS) library as the process management and data transport layer.
To manage the interaction between the ALF host runtime on the Opteron system and the ALF accelerator runtime on the SPE, this implementation starts a PPE process (ALF PPE daemon) for each ALF runtime. The PPE program is provided as part of the standard ALF runtime library.
Installing and configuring
The ALF for Hybrid-x86 library should be installed as a component of the SDK 3.0 (for more, refer to the SDK 3.0 Installation Guide). The following packages are provided for the ALF for Hybrid-x86 library.
alf-hybrid-3.0.0-*.x86_64.rpm. ALF for Hybrid-x86 runtime. This contains the optimized ALF host library for the x86_64 system.
alf-hybrid-3.0.0-*.ppc64.rpm. This package contains the static SPE accelerator runtime library and the ALF PPE daemon program.
alf-hybrid-devel-3.0.0-*.ppc64.rpm. ALF for Hybrid-x86 development package for PPC64 -- contains the SPE accelerator header files and error-checking enabled SPE accelerator libraries.
alf-hybrid-devel-3.0.0-*.x86_64.rpm. ALF for Hybrid-x86 development package for x86_64 -- contains the header files, optimized static x86_64 host libraries, and x86_64 error-checking enabled libraries.
alf-hybrid-trace-3.0.0-*.ppc64.rpm. ALF for Hybrid-x86 trace enabled runtime for PPC64 -- contains the trace-enabled SPE accelerator runtime library and the ALF PPE daemon program.
alf-hybrid-trace-3.0.0-*.x86_64.rpm. ALF for Hybrid-x86- trace enabled runtime for the x86_64 host -- contains the traced enabled host runtime library.
alf-hybrid-trace-devel-3.0.0-*.ppc64.rpm. ALF for Hybrid-x86 trace-enabled development package -- contains the trace-enabled header files for SPE.
alf-hybrid-trace-devel-3.0.0-*.x86_64.rpm. ALF for Hybrid-x86 trace-enabled development package -- contains the trace-enabled header files for the x86_64 host and the trace-enabled x86_64 static host library.
alf-hybrid-cross-devel-3.0.0-*.noarch.rpm. ALF Hybrid-x86 Cross development package -- contains the header files and libraries needed for cross-architecture development.
alfman-3.0-*.noarch.rpm. ALF for Hybrid-x86 man pages.
alf-hybrid-examples-source-3.0.0-*.noarch.rpm. ALF for Hybrid-x86 example sources.
Building an application
Three versions of the ALF for Hybrid-x86 libraries are provided with the SDK:
- Optimized: This library has minimal error checking on the SPEs and is intended for production use.
- Error-check enabled: This version has a lot more error checking on the SPEs and intended to be used for application development.
- Traced: These are the optimized libraries with performance and debug trace hooks in them. These are intended for debugging functional and performance problems associated with ALF.
Additionally, both static and shared libraries are provided for the ALF host libraries. The ALF SPE runtime library is only provided as static libraries.
An ALF for Hybrid-x86 application must be built as two separate binaries:
- The first binary is for the ALF host application; you need to do the following:
- Compile the x86_64 host application with the
D_ALF_PLATFORM_HYBRID define variable and specify the /opt/cell/sdk/prototype/usr/include include directory.
- Link the x86_64 host application with the ALF x86_64 host runtime library, alf_hybrid, found in the /opt/cell/sdk/prototype/usr/lib64 directory and the DaCS x86_64 host runtime library, dacs_hybrid, also found in the /opt/cell/sdk/prototype/usr/lib64 directory.
- The second binary is for the ALF SPE accelerator computational kernel; you need to do the following:
- Compile the application's SPE code with the
D_ALF_PLATFORM_HYBRID define variable and specify the /opt/cell/sysroot/usr/spu/include and the /opt/cell/sysroot/opt/cell/sdk/prototype/usr/spu/include include directories.
- Link the application's SPE code with the ALF SPE accelerator runtime library, alf_hybrid, found in the /opt/cell/sysroot/opt/cell/sdk/prototype/usr/spu/lib directory.
- Use the
ppu-embedspu utility to embed the SPU binary into a PPE ELF image. The resulting PPE ELF object needs to be linked as a PPE shared library.
For references, Makefiles are provided for all of the samples in the package alf-hybrid-examples-source-3.0.0-*.noarch.rpm.
Running an application
Before you do the following steps to run an ALF application, you need to ensure that the dynamic libraries libalf_hybrid and libdacs_hybrid are accessible. You can set this through
LD_LIBRARY_PATH. For example:
The steps to running the application are
- Build the ALF for Hybrid-x86 application, both the host application as an executable,
my_appl, and the accelerator computational kernel as a PPE shared library,
- Copy the PPE shared library with the embedded SPE binaries from the host where it was built to a selected directory on the Cell/B.E. where it is to be executed. For example:
scp my_appl.so <CBE> :/tmp/my_directory.
- Set the environment variable
ALF_LIBRARY_PATH to the step #2 selected directory on the Cell/B.E. For example:
- Set the processor affinity on the Hybrid-x86 host. For example:
taskset –p 0x00000001 $$.
- Run the x86_64 host application in the host environment. For example:
Linking to the correct library
Make sure that ALF applications are linked with the correct library for the ALF implementation intended (CBEA or Hybrid-x86). Linking against the wrong library does not produce a link error, but can result in endian problems with input or output data (wrong results) or address translation issues on the SPU (typically DMA errors from SPU initiated DMAs -- the SPU returns a fatal error and the ALF runtime exits).
Optimizing an application
To optimize your ALF applications, we're going to look at the following topics: Accelerator data partitioning, multi-use work blocks, and data sets.
Using accelerator data partitioning. If the application operates in an environment where the host has many accelerators to manage and the data partition schemes are particularly complex, it is generally more efficient for the application to partition the data and generate the data transfer lists on the accelerators instead on the host.
Using multi-use work blocks. If there are many instances of the task running on the accelerators and the amount of computation per work block is small, the ALF runtime can become overwhelmed with moving work blocks and associated data in and out of accelerator memory. In this case, multi-use work blocks can be used in conjunction with accelerator data partitioning to further improve performance for an ALF application.
Based on the first two topics, what should you consider for data layout design? Efficient data partitioning and data layout design is the key to a well-performed ALF application. Data partition and layout is closely coupled with compute kernel design and implementation and should be considered simultaneously. Keep the following in mind for your data layout and partition design:
- Use the correct size for the data partitioned for each work block. Often the local memory of the accelerator is limited. Performance can degrade if the partitioned data cannot fit into the available memory.
- Minimize the amount of data movement. A large amount of data movement can cause performance loss in applications. Improve performance by avoiding unnecessary data movements.
- Simplify data movement patterns. Although the data transfer list feature of ALF enables flexible data gathering and scattering patterns, it is better to keep the data movement patterns as simple as possible. Some good examples are sequential access and using contiguous movements instead of small discrete movements.
- Avoid data reorganization. Data reorganization requires extra work. It is better to organize data in a way that suits the usage pattern of the algorithm than to write extra code to reorganize the data when it is used.
Using data sets. The data set is the primary mechanism for optimizing an ALF application on a hybrid system. Using the data set improves ALF application performance on the hybrid environment significantly. The ALF on Hybrid implementation uses the data set to speed up data read/write access time by migrating the data set closer to the accelerators. That is, when the task is ready and is scheduled for execution on a specific PPE, the associated data set is transferred from the Hybrid host to the PPE. This makes the data set available for the accelerator's data access. All read-only and read-write buffers are transferred then.
Similarly, when you issue the
alf_task_wait function and the task completes, the data set is transferred from the PPE to the Hybrid host. This makes the data set available for the host's data access. All the write-only and read-write buffers are transferred then.
There are several platform-specific constraints for the ALF implementation on the Hybrid-x86 architecture, including SPE accelerator memory constraints, data transfer list limitations, data set constraints, and others. Let's look at them in quick detail.
SPE accelerator memory constraints. The size of local memory on the SPE accelerator is 256KB and is shared by code and data. Memory is not virtualized and is not protected. In the typical memory map of an SPU program, there is a runtime stack atop the global data memory section. The stack grows from the higher address to the lower address until it reaches the global data section.
Due to the limitation of programming languages and compiler and linker tools, you cannot predict the maximum stack usage when you develop the application and when the application is loaded. If the stack requires more memory than what was allocated, you do not get a stack overflow exception (unless this was enabled by the compiler at build time) -- you get undefined results such as bus error or illegal instruction.
When there is a stack overflow, the SPU application is shut down and a message is sent to the PPE. ALF allocates the work block buffers directly from the memory region above the runtime stack -- this is implemented by moving the stack pointer (or equivalently by pushing a large amount of data into the stack).
To ALF, the larger the buffer is, the better it can optimize the performance of a task by using techniques like double buffering. It is better to let ALF allocate as much memory as possible from the runtime stack. If the stack size is too small at runtime, a stack overflow occurs and it causes unexpected exceptions such as incorrect results or a bus error.
Data transfer list limitations. Data transfer information is used to describe the five types of data movement operations for one work block as defined by
ALF_BUF_TYPE_T. The ALF implementation on Cell/B.E. has the following internal constraints:
- Data transfer information for a single work block can consist of up to eight data transfer lists for each type of transfer as defined by
ALF_BUF_TYPE_T. For programmers, the limitation is that
alf_wb_dtl_begin can only be called no more than eight times for each type of
ALF_BUF_TYPE_T for each work block. An
ALF_ERR_NOBUFS is returned in this case. Due to limitation items 2, 3, and 4 in this list, it is possible that the limitation can be reached without explicitly calling
alf_wb_dtl_begin by eight times.
- Each data transfer list consists of up to 2048 data transfer entries. The
alf_wb_dtl_entry_add call automatically creates a new data transfer list of the same type when this limitation is reached. Limitation item 1 in this list still applies in this case.
- Each entry can describe up to 16KB of data transfer between the contiguous area in host memory and accelerator memory. The
alf_wb_dtl_entry_add call automatically breaks an entry larger than 16KB to multiple entries. Limitation items 1 and 2 in this list still apply in this case.
- All of the entries within the same data transfer list share the same high 32 bits effective address. This means that when a data transfer entry goes across 4GB address boundary, it must be broken up and put into two different data transfer lists. In addition, two succeeding entries use different high 32-bit addresses, they need to be put into two lists. The
alf_wb_dtl_entry_add call automatically creates a new data transfer list in the above two situations. Limitation items 1, 2, and 3 in this list still apply in this case.
- The local store area described by each entry within the same data transfer list must be contiguous. You can use the local buffer offset parameter
offset_to_accel_buf to address with in the local buffer when
alf_wb_dtl_begin is called to create a new list.
- The transfer size and the low 32 bits of the effective address for each data transfer entry must be 16 bytes aligned. The
alf_wb_dtl_entry_add call does NOT help you to automatically deal with alignment issues. An
ALF_ERR_INVAL error is returned if there is an unaligned address. The same limitation also applies to the
offset_to_accel_buf parameter of
Data set constraints. The ALF for Hybrid-x86 data set implementation is limited to eight buffers. Providing a data set with no data buffers or buffers with a size of zero returns an error. The API specification does not specify the behavior in this case.
Other limitations. Other known limitations for the ALF for Hybrid-x86 implementation include
- The current implementation does not support the registration of an error handler. A call to the API function
alf_error_handler_register results in a no-op. ALF does not provide a default error handler. The runtime exits when it encounters an error.
- Multiple invocations of
alf_init() within the same program are not supported. A second call to
alf_init() results in an error.
- Accelerator partitioning only works if there is a data set associated with the specified task.
- The ALF for Hybrid-x86 implementation currently does not support data coherency across multiple Cell/B.E. BladeCenters. Using a data set with one Hybrid x86 host and multiple Cell/B.E. BladeCenters leads to data corruption. However, host data partitioning without data sets works across multiple Cell/B.E. BladeCenters.
- The ALF API supports the concept of heterogeneous accelerators; however, in this hybrid implementation of ALF, only the SPE is supported as an accelerator. The
accel_type parameter in the API function
alf_task_desc_create is ignored.
- If strict checking is enabled (by providing a definition for
_ALF_CFG_CHECK_STRICT_), all the buffer sizes provided through
alf_task_desc_set_int32 need to be multiples of 16 bytes, otherwise the SPE returns a fatal error.
Taken from the Accelerated Library Framework for Hybrid-x86 Programmer’s Guide and API Reference. Download the SDK 3.0. Check out some reference guides in the Cell Resource Center SDK library.