Programming the Cell Broadband Engine - Embedded.com

Programming the Cell Broadband Engine

With nine processor cores, a single Cell processor chip (called a Cell Broadband Engine or CBE) often performs an order of magnitude more work than a traditional single-core chip at the same clock rate. Cell's parallel configuration and performance are seldom seen in traditional CPU architectures for any market–much less the cost-sensitive consumer electronics business.

The CBE, jointly created by Sony, Toshiba, and IBM, distributes its huge computational capacity over two different kinds of processor cores, so its development environment is quite different from that of comparatively conventional homogeneous multiprocessor architectures. Cell programmers need special facilities to help them harvest such computation resources effectively.

In this article, I'll introduce a new object format called CBE Embedded SPE Object Format (CESOF). Programmers in the Sony, Toshiba, and IBM Cell Broadband Engine Design Center (STIDC) created the CESOF specification to help Cell programmers integrate the interacting programs for these two different types of processor cores. I'll also introduce the design concept, the structure, and a simple usage sample of CESOF.

Multiple cores in the Cell processor
Using heterogeneous (that is, different kinds of) processor cores in a multi-core system has become a popular practice in the embedded systems space. For a particular algorithm or application with expected regularity, a specialized and highly optimized circuit usually provides better performance within a smaller chip area and with lower power consumption than general-purpose cores. In fact, embedded systems designers recognize that many of their applications or workloads can benefit from specialized cores such as a single instruction, multiple data (SIMD) engine, a floating-point accelerator, or a direct memory access (DMA) controller.

High-performance computation workloads, modern media-rich applications, and many algorithms in other domains all exhibit a lot of regularity in their tasks. Replacing one or more of the generic processor cores with specialized circuits will likely give a better performance/cost ratio for these applications.

Cell's chip design, shown in Figure 1, strikes a balance by using one generic Power Processor Element (PPE) and eight Synergistic Processor Elements (SPEs) to provide a better performance/cost ratio (in terms of chip area and power consumption), particularly for high-performance computing and media processing. The eight SPEs are specialized SIMD cores, each with its own private local memory. The performance/cost ratio is particularly impressive when an algorithm can be distributed over all eight engines at the same time with properly staged data traffic.

Each SPE can run independently from the others. Its instruction set is designed to execute SIMD instructions efficiently. All SIMD instructions handle 128-bit vector data in different element configurations: byte, half word, word, and quad-word sizes. For example, one SIMD instruction can perform 16 character operations at the same time.

Another important design aspect is the use of the on-chip local memory located next to each SPE. This closeness reduces the distance, and thus the latency, from a processor core to its execution memory space. The address space of the SPE instructions spans only its own local memory; the SPE fetches instructions from the local store, loads data from the local store, and stores data to the local store. SPE instructions cannot “see” the rest of the chip's (or the system's) address space.

The simplicity of this local memory design improves memory-access time, memory bandwidth, chip area, and power consumption, but it does require extra steps for an SPE program to bring external data into the local store. An SPE can't load or store data to/from the system memory directly. Instead, it uses a DMA operation to transfer data between the system memory and its local store. This is quite different from the general-purpose PPE core. The PPE load and store instructions access the data directly from the effective address backed by off-chip physical system memory.

As a side note, the internal core of an SPE, without the DMA engine, is called a Synergistic Processor Unit (SPU). The use of SPU in the naming convention of software code is sometimes intermixed with the use of SPE where a distinction may not be as important.

Connecting these nine cores (one PPE and eight SPEs) with the physical memory is a high-speed bus called Element Interconnect Bus (EIB). Through this bus, an SPE DMA engine (not the SPE load and store instructions) transfers data between the system memory and its local store memory.

Developing and combining the code modules for cores with different instruction sets and memory spaces presents a big challenge to conventional programming tools. Programmers need an additional facility, such as CESOF, to glue these heterogeneous code modules together. In the remainder of this article I'll introduce the design concept, the structure, and a simple usage example of CESOF.

The role of a symbol
Programmers understand that symbols are key to the entire development activity. They provide the fundamental abstraction mechanism for programmers to represent and reuse all kinds of programming elements. For example, symbols name and represent functions and data variables. A programmer writes code in certain programming language to define a symbol that can be used in other places.

The definition of a symbol in a module and its use in other modules represent how the module interacts with the others. The use of the symbol effectively captures the dependency relationship of the interactions. As those interactions grow more complicated, the tool chain can help the programmer sort out the interactions by using the symbol information of the interdependent modules.

However, not all the interactions can be represented by symbols. For example, the dynamic side effects of a function call can't be represented by the use of symbols or its function interfaces. When the interaction information isn't captured inside of the code module, a tool can't help the programmer manage such dependencies. A programmer needs to programmatically handle these dynamic dependencies during runtime and pay a runtime overhead.

The role of a linker
Figure 2 shows that source code is compiled into a linkable code module, which a linker subsequently combines with other linkable modules into an executable image. The executable image can now be loaded into system memory for the processor to execute.

You use the linker to combine interdependent code modules (linkables ) into an executable binary image for its target processor. The entire linked image is stored in a file or a set of closely related files called executables . The linker arranges the executable image so that an operating system can load it efficiently into the system memory for execution.

A linker can help you find the dependencies among the code modules. It combines the code modules together based on where a particular symbol is defined (implemented) and where the symbol is used.

A symbol usually has an associated value. For a function or data object, the symbol value is usually the effective address where the function or the data object is placed in the final execution image. (Different symbol types have different kind of values. Detailed definitions can be found in the tool interface standard for the specific target architecture.)

For the purposes of this article, I'll call the process of finding a symbol's definition symbol resolution and the process of connecting the use of a symbol to its matching definition symbol binding . The linker resolves and binds the symbols of the code modules ahead of time so that an application doesn't repeatedly pay the linking overhead during run time. Linking also rearranges the machine code pieces in a way that a loader can load easily into system memory. This shortens the initialization time of the program.

For a CBE application, execution usually involves PPE and SPE programs interacting and running at the same time. It's also generally desirable to allow a programmer to represent the inter-architectural (PPE versus SPE) interactions symbolically in the code module so that a linker can possibly handle such dependencies. The inter-architectural linking facility will similarly improve the initialization and execution performance of the final executable image.

The whole execution image of the application consists of code modules from two different instruction sets. The traditional tool chain infrastructure and Executable and Linking Format (ELF), as defined in the Tool Interface Standard, is not set up for inter-architectural linking. It assumes that the symbols and their resolution mechanism are defined and used in the same architectural space.

As shown in Figure 3, early Cell programmers were forced to use two independent sets of tool chains. They used two different compilers and two different linkers, one set for the PPE executables and one set for the SPE executables. The linkables and executables for different architectures did not intermix. They had to be developed and managed separately.

Larger view of Figure 3

Because the PPE and SPE code modules were developed separately from one another, the programmers had no way of specifying inter-architectural symbols for inter-architectural interactions. Their inter-architectural dependencies thus could not be recognized and handled by either linker. Before CESOF, a CBE programmer had to write code to handle these inter-architectural module interactions and their dependencies. This additional step not only cost performance during runtime but was also a tedious and error-prone task for the programmer.

Inter-architectural program interactions
It turns out that the majority of Cell's inter-architectural module interactions can be mapped to different effective addresse s. The effective addresses can be logically represented by symbols in the effective-address space (on the PPE side). CESOF is a usage convention of the current ELF. It allows you to associate these PPE symbols with the corresponding SPE modules. The new convention thus allows the SPE modules to “participate” in a meaningful way to the module interaction in the PPE linking and execution process.

So far, I have identified two types of commonly seen inter-architectural program interactions that could be logically represented by PPE symbols. The representation allows the PPE linker and runtime environment to effectively resolve and bind such dependencies on behalf of the programmer. I indicated these interactions using the blue lines in Figure 3. Note that there are other kinds of inter-architectural interactions, like mailbox access and interrupts, that can't be mapped to effective addresses or symbol representations reasonably well. You still have to handle those interactions manually.

PPE program accessing SPE programs
The first type of interaction relates to the role of a PPE program in a CBE application. Due to the generic nature of the PPE core, a PPE program usually acts as a controller. It starts and stops other PPE tasks as well as the interacting SPE programs.

You can see in Figure 3 that before an SPE program can be loaded into the SPE local store for execution, the SPE executable image must first be mapped (either by the operating system at run time or by the PPE program itself) to a global effective-address memory. The mapped address can then be made available to the controlling PPE program. Through an SPE loader, the SPE executable image is then transferred (via DMA) from the effective-address memory to the SPE's local store for execution. For any SPE programs needed by a CBE application, their mapped effective addresses must eventually be given to the SPE loader. You can logically assign a PPE symbol to represent this SPE executable image, and the symbol value will be its mapped effective address.

SPE program accessing PPE symbols
The second type of interaction is when an SPE program needs the effective address of a certain data object residing in the effective-address space. In most cases, the global address is used as the source or target of a DMA operation.

In conventional single-architecture programming, you can obtain the effective address directly through the symbol value of a statically defined data object. The SPE programming environment, however, doesn't have any such symbol representation in the shared symbol space. You must use other runtime means, such as an explicit DMA operation, to pass an effective address to an SPE program. Setting up this type of interaction can be tedious and error-prone for a programmer without the help of a linker. It also costs run-time overhead during program initialization.

For statically defined EA (effective-address) data objects, CESOF introduces a new type of SPE variable to hold effective addresses in the SPE program, as shown in Figure 4. These variables are called an effective-address reference (EAR) . Each EAR variable keeps one resolved effective address of a PPE symbol. (Note that the EAR variable is not the EA data object itself.) CESOF logically assigns the EAR variable an SPE symbol “mangled” from the associated PPE symbol name. CESOF prefixes the _EAR_ string to the original PPE symbol.

I'll show later how providing such an association enables a linker or similar tool to use the symbol information to bind the EAR variables to their corresponding effective-address values and thus effectively pass to the SPE program these effective addresses.

On the other hand, a PPE program may also need to access the locally stored data objects represented by the SPE symbols defined in the SPE code module. Because the resolved value of an SPE symbol doesn't change in different context when the SPE executable is used by different PPE programs, the resolved symbol value can be extracted directly from the symbol information of the SPE executable. The extracted values can be reused directly by the PPE linker (per linker script). There's still some discussion over whether the CESOF specification should be extended to cover this.

Structure of CESOF linkable
By representing logical dependencies with the use of PPE symbols, the PPE linker can resolve the symbol dependencies and bind the references for the programmer.

CESOF defines how a programmer can wrap an SPE executable into a PPE linkable with the additional PPE symbols representing the inter-architectural dependency. In order to allow the PPE linker to handle such objects, CESOF must be compatible with the existing Tool Interface Standard (TIS) Executable and Linking Format (ELF) Specification for PPE architecture. Thus, once wrapped, the new linkable is the same as any other PPE linkables that the PPE linker can properly link into a CBE executable image.

Figures 5 and 6 illustrate how the CESOF linkable allows an embedded SPE executable to participate in the PPE linking process, and, subsequently, piggyback on the loading process of the entire CBE executable image into the effective-address memory.

Larger view of Figure 6

In Figure 5, a tool called embedspu wraps an SPE executable file into a CESOF linkable file. The CESOF file contains the image of the original SPE executable plus additional PPE symbol information representing the two types of inter-architectural interactions previously discussed.

The CESOF linkable, which is itself a PPE linkable, can now be linked with other PPE linkables to form a PPE executable as shown in Figure 6. The PPE executable image contains not only the PPE code modules but also the embedded SPE executable image. I'll call a PPE executable with embedded SPE executable images a CBE executable . The PPE loader can load the CBE executable including the embedded SPE executable image(s) just like any other PPE executable into the effective-address space. From there, an SPE loader can load the SPE executable image into the target local store.

Elements in a CESOF linkable
A CESOF file is an ELF file targeted for PPE architecture and is itself a PPE linkable. It uses several ELF sections to include the embedded SPE executable image with additional PPE symbol information.

Figure 7 shows the structure of a CESOF linkable, including the three CESOF elements created by embedspu.

Larger view of Figure 7

1. Embedded SPE ELF image: A special section called .spe.elf contains a whole copy of the SPE ELF image (an executable image) that can be loaded by an SPE loader into an SPE local store for execution. This SPE ELF image is generated by the SPE linker. The embedspu utility creates a PPE symbol to represent this .spe.elf section. This part represents the first type of inter-architectural interaction previously mentioned.

Because the SPE executable image might be shared by several CBE applications running on the same machine, this section is marked read-only and, if shared, the operating system should keep only one copy of the CESOF linkable image in the system memory.

2. Shadow SPE toe area: CESOF uses this shadow area to represent the second type of interaction. Because the effective-address symbols are not applicable to the SPE programming space, instead of using a real effective-address symbol in the SPE code, the SPE programmer declares a special structure variable to hold a 64-bit effective-address value. Recall that such a variable is called an effective-address reference (EAR). These EARs are collected in a special section named .toe (table of EARs) section. When the SPE linkables are eventually linked into an SPE executable image, all .toe sections are tiled into the TOE segment of an SPE executable.

The Shadow SPE toe area is a space mirroring the TOE segment of the embedded SPE executable. For each EAR in the TOE segment, embedspu creates a matching PPE symbol reference in the shadow area. The PPE linker resolves and binds these PPE symbol references in the shadow area when the CESOF linkable is linked with the other PPE linkables.

To execute the embedded SPE executable image, the SPE loader first loads the SPE executable image into the local store. Then, it copies the resolved shadow area into the local store over the SPE TOE segment. This copy operation effectively binds all the EARs in the SPE program with the matching effective-address values.

Different processes may have different effective-address mappings for the same effective-address symbol. The shadow area must be included in a .data section that each process has that's not shared with other processes. The linker should not bind the values directly into an SPE executable image because the same image is shared by different processes running on the same system.

3. SPE program handle: CESOF provides a run-time data structure to keep the related CESOF information together. This data structure contains two pointers. One points to the section containing the embedded SPE executable image and the other points to the shadow area. These two pointer values are only resolved when the PPE linker links the CESOF linkable into the final executable image.

The embedspu tool also defines a symbol for this handle structure. The programmer can pass the handle to the SPE loader that will need the embedded SPE program image and the shadow area with the resolved EAR values.

CESOF programming example
Let's use some code examples to illustrate how CESOF helps the Cell programmer resolve and bind interacting SPE and PPE code modules.

I'll start with an SPE program called spe_hello . The SPE programmer uses the effective address of an array named foo , defined in a PPE code module, as the target of a DMA operation. The DMA operation simply transfers a “Hello World!” string into the foo array.

The SPE compiler and linker, included in the SDK released by IBM, generate SPE programs spanning the local store. Recall that the foo symbol, which resides in the effective-address space, can't be used directly by the SPE program.

Instead, CESOF requires the SPE program to declare a special kind of variable to hold the effective-address value of foo , not the foo object itself. The CESOF specification requires that such variables be represented by symbols having the _EAR_ prefix. Such variables can be declared “unsigned long long” for 64-bit effective-address values. (Actually, an EAR entry reserves a 128-bit memory space.) The special prefix signature allows an SPE compiler or a post-processing tool to generate the EAR entries. These EAR entries reside only in the .toe section. However, their shadows in the effective-address space in the CESOF linkable will later be resolved by the PPE linker with the effective addresses of the corresponding effective-address symbols.

In the SPE code example shown in Listing 1, the _EAR_foo represents an EAR entry to be bound with the effective address of foo .


Listing 1 SPE code example

/*************************/
/* filename: spe_hello.c */
/*************************/
#include
/* here we declare an EAR for foo's EA */
extern unsigned long long _EAR_foo;

char hello_string[] = “Hello World!”;

int main(long long spuid, char** argp, char** envp)
{
/* Here we copy the string to the foo array in EA */
copy_from_ls(_EAR_foo, hello_string, sizeof(hello_string));

return 0;
}

/* this section maybe generated automatically by an SPE compiler */
/*****************************/
/* filename: spe_hello_toe.s */
/*****************************/
section .toe, “a”, @progbits
.align 4
.global _EAR_foo
_EAR_foo :
.octa 0x0


These two files can be compiled and linked by the SPE compiler into an SPE executable file. Let's name it spe_hello .

$ spu-gcc -c spe_hello.c
$ spu-gas spe_hello_toe.s -o spe_hello_toe.o
$ spu-gcc -o spe_hello spe_hello.o spe_hello_toe.o

Note that, without the help of a higher-level SPE compiler, we would have to manually define an EAR entry and its associated EAR symbol (_EAR_foo ) in the special .toe section by using the assembly file spe_hello_toe.s . This mechanical task can be done by a post-processing tool or, ideally, the future higher-level SPE compiler with special EAR keyword or attribute declaration.

In order for this spe_hello to participate in the linking and execution with the PPE code modules, let's embed it into a CESOF linkable.

$ embedspu spe_hello_handle spe_hello spe_hello_csf.o

We tell embedspu tool to associate the symbol–spe_hello_handle with the embedded SPE executable–spe_hello . The tool creates a CESOF linkable called spe_hello_csf.o .

In addition to the embedded SPE executable image and the shadow area, embedspu tool strips the _EAR_ prefix from the SPE EAR symbol names (in spe_hello executable) back to the matched PPE symbol names. These PPE symbol names are used to create the PPE symbol references in the shadow area. The order of the matched PPE symbol references in the shadow area must follow that of the matching EAR entries in the SPE TOE segment. The same order ensures that the EAR variables can be bound properly to the matched effective-address values.

Listing 2 shows a piece of the PPE code that uses the new symbol–spe_hello_handle .


Listing 2 PPE code that uses the symbol spe_hello_handle

/************************/

/* filename: ppe_main.c */

/************************/

/* the symbol “spe_foo_handle” defined in spe_hello_csf.o */

extern spe_program_handle_t spe_hello_handle ;

/* an EA symbol representing an array object */

char foo[512]; /* this is the foo array that spe_hello accesses */

int main()

{

int rc, status;

speid_t spe_id;

/* load & start the spe_hello program on an SPE */

spe_id = spe_create_thread (0, &spe_hello_handle , 0, NULL, -1, 0);

/* wait for spe prog. to complete and return its status */

rc = spe_wait(spe_id, &status, 0);

printf(“string from spe_hello: %sn”, foo);

return status;

}


In the call to spe_create_thread function, we use the symbol spe_hello_handle to refer to the wrapped SPE executable. Subsequently, the SPE loader uses the handle passed in for the purpose of finding the effective address of the real SPE executable image and load it into the target local store. Afterward, it loads the resolved shadow area over the TOE segment in the local store. This operation completes an SPE executable image with all the EAR entries bound to their corresponding symbol values (which are the effective addresses of the represented data objects).

After SPE loader loads the bound SPE executable image into the local store, the run-time environment then starts the execution of the SPE program.

Here are the steps used to create the final CBE application called cbe_main .

$ ppu-gcc -c ppu_main.c

$ ppu-gcc -o cbe_main ppu_main spe_hello_csf.o

Superior linking
CESOF defines a usage convention on top of the existing ELF specification so that an SPE executable can be wrapped into a PPE linkable in a logical and standardized manner. The CESOF linkable then allows the embedded SPE executable to participate in the linking and execution with other PPE linkables.

The CBE Linux Reference Implementation ABI v1.0 contains the specification of CESOF. The reference implementation of the tool chain (the embedspu tool, SPE linker, and PPE linker) and the CESOF examples are available in the CBE SDK published by IBM. Their implementation fully supports the current CESOF specification.

This standard establishes a foundation to build a higher-level SPE compiler capable of spanning the entire effective-address range. However, we will need additional language syntax to specify the location of a data object. We'll also need additional compiler run-time support to transfer the data to and from the local store for processing.

One possible extension to the current C language for SPE programming would be the support of effective-address variables. Such support can hide all the EAR details and DMA operations from the programmers. The two C compilers in the CBE SDK don't support such higher-level effective-address abstraction yet. However, an IBM XLC prototype making use of this mechanism has been working.

The standard allows other CBE programming frameworks to represent inter-architectural connections with the EAR entries; for example, the connectors for a streaming programming framework.

Other higher-level abstractions over the whole chip, for example single-source compilers, SPE software controlled cache, or SPE code overlay, can similarly benefit from this common infrastructure.

Alex Chunghen Chow is a senior programmer and a software development manager in IBM's Cell Processor Design Center (Austin, Texas). He leads a group of programmers developing kernel, toolchain, workloads, libraries, demo, and samples for the Cell processor chip bring-up. He received a BS and MS degrees in electronic engineering from National Chiao-Tung University, Taiwan. He also received a PhD from the University of Arizona in electrical and computer engineering. His fields of work include discrete event simulation, distributed and parallel simulation/programming, and object-oriented framework. His current focus is in the framework and programming model of multicore programming environment. You can reach him at .

Acknowledgements:
The author would like to thank Michael Day, Stan Gowen, Ulrich Weigand, Kevin K. O'Brien, Daniel Brokenshire, Taka Uchikawa, and Jimi Xenidis for their contributions to the CESOF specification and development, and, especially, Daniel Brokenshire and Ted Maeurer for providing their valuable comments toward this article.

For more on Cell processor, see Jim Turley's “A glimpse inside the Cell processor”.

Reader Response


Great idea! Even the DMAs must be hidden from the programmers to make it more elegant.

The explanation and illustration are very good. This is a good article.

We did not know so much background and details about CESOF, though we were using it from the SDK.

– Vaidyanathan Srinivasan
IBM
Bangalore, India


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.