In August, 2006 I attended the Girvan Workshop forthe Cell Broadband Engine and it's an experience I'll never forget. Fortwo solid days, IBM engineers explained the processor's architecture,tools and the many software libraries available for building Cellapplications.
I was stunned, not only by the processor's extraordinarycapabilities, but also by how much there was to learn: spulets and CESOFfiles, Altivec and Single instruction, Multiple Data (SIMD) math,debuggers and simulators. I did my best to comprehend it all, but mostof the material flew over my head in a fierce whoosh .
When Sony released the Playstation 3, I grabbed the first consoleoff the shelf and ran home for a more thorough investigation. It wasdaunting at first. Then, as now, IBM's Software Development Kitprovided a vast number of documents that covered three essentialsubjects: development tools, software libraries and the processoritself, The docs were helpful, but there was no overlap or coordinationbetween them. This is a serious problem because any practical Celldeveloper needs to understand these subjects an an integrated whole.
It took time before the whooshing sound dissipated, but when it did,I genuinely understood how to program the Cell's PowerPC Processor Unit(PPU) and Synergistic Processor Units.
It wasn't that hard, really ” just regular C/C++ and a set ofcommunications mechanisms. Yet the blogs and discussion groupsdisagreed: to them Cell programming was much too complex for normaldevelopers to understand. However, they hadn't really given the Cell achance; they saw the disjointed pieces, but not how they fit together.
Programmingthe Cell Processor is my best attempt to reduce the whoosh associated with Cell development. My goal is to tie together the Cell'stools, architecture and libraries in a straightforward progression thatappeals to intuition. And I've included many code examples so that youcan follow the material in a hands-on fashion. To download theexamples, go to http://informit.com/title.
Editor's Note: Reproduced by permission of the book'spublisher, Pearson Education, Inc., this series of five articlesdescribes how developers can use a collection of open source GCCdevelopment tools, the Linux operating system and the Eclipse IDE to dodevelopment on the Cell multicore architecture. Part 1 starts on the next page, and describes the basics of the Cell Processor architecture, introducing the developer briefly to the Cell Software Development Kit. Part 2 is on BuildingApplications for the Cell processor; Part 3 covers debugging the Cell processor; Part 4 covers simulating applications and Part 5 is about the Cell SDK IDE, including Eclipse and the C/C++development tooling as well as detailing how to manage an SPU projectwith the Cell IDE.
In Randall Hyde's fine series of books, WriteGreat Code, one of his fundamental lessons is that, for optimalperformance, you need to know how your code runs on the targetprocessor. Nowhere is this truer than when programming the Cellprocessor.
It isn't enough to learn the C/C++ commands for the different cores;you need to understand how the elements communicate with memory and oneanother.This way, you'll have a bubble-free instruction pipeline, anincreased probability of cache hits, and an orderly, nonintersectingcommunication flow between processing elements. What more could anyoneask?
Figure 1.1 below shows the primary building blocks of theCell: the Memory Interface Controller (MIC), the PowerPC ProcessorElement (PPE), the eight Synergistic Processor Elements (SPEs), theElement Interconnect Bus (EIB), and the Input/Output Interface (IOIF).Each of these is explored in greater depth throughout the book, but fornow, it's a good idea to see how they function individually andinteract as a whole.
|Figure1.1 The top-level anatomy of the Cell processor|
The Memory Interface Controller (MIC)
The MIC connects the Cell's system memory to the rest of the chip. Itprovides two channels to system memory, but because you can't controlits operation through code, the discussion of the MIC is limited tothis brief treatment. However, you should know that, like thePlayStation 2's Emotion Engine, the first-generation Cell supportsconnections only to Rambus memory.
This memory, called eXtreme Data Rate Dynamic Random Access Memory,or XDR DRAM, differs from conventional DRAM in that it makes eight datatransfers per clock cycle rather than the usual two or four.This way,the memory can provide high data bandwidth without needing very highclock frequencies.The XDR interface can support different memory sizes, and the Playstation 3, for example, uses 256MB of XDE DRAM as its system memory.
The PowerPC Processor Element (PPE)
The PPE is the Cell's control center. It runs the operating system,responds to interrupts, and contains and manages the 512KB L2 cache. Italso distributes the processing workload among the SPEs and coordinatestheir operation. Comparing the Cell to an eighthorse coach, the PPE isthe coachman, controlling the cart by feeding the horses and keepingthem in line.
As shown in Figure 1.2 below , the PPE consists of twooperational blocks.The first is the PowerPC Processor Unit, or PPU.Thisprocessor's instruction set is based on the 64-bit PowerPC 970architecture, used most prominently as the CPU of Apple Computer'sPower Mac G5.The PPU executes PPC 970 instructions in addition to otherCellspecific commands, and is the only general-purpose processing unitin the Cell.This is why Linux is installed to run on the PPU and not onthe other processing units.
|Figure1.2 Structure of the PPE|
But the PPU can do more than just housekeeping. It contains IBM'sVMX engine for Single Instruction, Multiple Data (SIMD) processing.Thismeans the PPU can operate on groups of numbers (e.g.,multiply two setsof four floating-point values) with a single instruction. The PPU'sSIMD instructions are the same as those used in Apple's imageprocessing applications, and are collectively referred to as theAltiVec instruction set.
Another important aspect of the PPU is its capacity for symmetricmultithreading (SMT).The PPU allows two threads of execution to run atthe same time, and although each receives a copy of most of the PPU'sregisters, they have to share basic on-chip execution blocks.
This doesn't provide the same performance gain as if the threads ranon different processors, but it allows you to maximize usage of the PPUresources. For example, if one thread is waiting on the PPU's memorymanagement unit (MMU) to complete a memory write, the other can performmathematical operations with the vector and scalar unit (VXU).
The second block in the PPE is the PowerPC Processor StorageSubsystem, or PPSS. This contains the L2 cache along with registers andqueues for reading and writing data. The cache plays a very importantrole in the Cell's operation: not only does it perform the regularfunctions of an L2 cache, it's also the only shared memory bank in thedevice. Therefore, it's important to know how it works and maintainscoherence.
The Synergistic Processor Element (SPE)
The PPU is a powerful processor, but it's the Synergistic ProcessorUnit (SPU) in each SPE that makes the Cell such a groundbreakingdevice.These processors are designed for one purpose only: high-speedSIMD operations. Each SPU contains two parallel pipelines that executeinstructions at 3.1GHz.
In only a handful of cycles, one pipeline can multiply andaccumulate 128-bit vectors while the other loads more vectors frommemory. SPUs weren't designed for general-purpose processing and aren'twell suited to run operating systems. Instead, they receiveinstructions from the PPU, which also starts and stops their execution.
The SPU's instructions, like its data, are stored in a unified 256KBlocal store (LS), shown in Figure 1.3 below.The LS is not cache; it'sthe SPU's own individual memory for instructions and data.This, alongwith the SPU's large register file (128 128-bit registers), is the onlymemory the SPU can directly access, so it's important to have a deepunderstanding of how the LS works and how to transfer its contents toother elements.
|Figure1.3 Structure of the SPE|
The Cell provides hardware security (or digital rights management,if you prefer) by allowing users to isolate individual SPUs from therest of the device.While an SPU is isolated, other processing elementscan't access its LS or registers, but it can continue running itsprogram normally.The isolated processor will remain secure even if anintruder acquires root privileges on the PPU.
Figure 1.3 above shows the Memory Flow Controller (MFC)contained in each SPE. This manages communication to and from an SPU,and by doing so, frees the SPU for crunching numbers. Morespecifically, it provides a number of different mechanisms forinter-element communication, such as mailboxes and channels.
The MFC's most important function is to enable direct memory access(DMA).When the PPU wants to transfer data to an SPU, it gives the MFCan address in system memory and an address in the LS, and tells the MFCto start moving bytes.
Similarly, when an SPU needs to transfer data into its LS, it cannot only initiate DMA transfers, but also create lists oftransfers.This way, an SPU can access noncontiguous sections of memoryefficiently, without burdening the central bus or significantlydisturbing its processing.
The Element Interconnect Bus (EIB)
The EIB serves as the infrastructure underlying the DMA requests andinter-element communication. Functionally, it consists of four rings,two that carry data in the clockwise direction (PPE > SPE1 > SPE3> SPE5 > SPE7 > IOIF1 > IOIF0 > SPE6 > SPE4 > SPE2> SPE0 > MIC) and two that transfer data in the counterclockwisedirection. Each ring is 16 bytes wide and can support three datatransfers simultaneously.
Each DMA transfer can hold payload sizes of 1, 2, 4, 8, and 16bytes, and multiples of 16 bytes up to a maximum of 16KB. Each DMAtransfer, no matter how large or small, consists of eight bus transfers(128 bytes)..
The Input/Output Interface (IOIF)
As the name implies, IOIF connects the Cell to external peripherals.Like the memory interface, it is based on Rambus technology: FlexIO.TheFlexIO connections can be configured for data rates between 400MHz to8GHz, and with the high number of connections on the Cell, its maximumI/O bandwidth approaches 76.8GB/s.
In the PlayStation 3, the I/O is connected to Nvidia's RSX graphicprocessor.The IOIF can be accessed only by privileged applications, andfor this reason, interfacing the IOIF lies beyond the scope of thisbook.
The CBE Software Development Kit
This book uses a hands-on approach to teach Cell programming, so thedevelopment tools are very important. The most popular toolset is IBM'sSoftware Development Kit (SDK), which runs exclusively on Linux andprovides many different tools and libraries for building Cellapplications.
IBM provides the SDK free of charge, although some of the tools havemore restrictive licensing than others. For the purposes of this book,the most important aspect of the SDK is the GCC-based toolchain forcompiling and linking code.
The two compilers, ppu-gcc and spu-gcc, compile code for the PPU andSPU, respectively.They provide multiple optimization levels and cancombine scalar operations into more efficient vector operations.
The SDK also includes IBM's Full-System Simulator, tailored for Cellapplications. This impressive application runs on a conventionalcomputer and provides cycle-accurate simulation of the Cell processor,keeping track of every thread and register in the PPU and SPUs. Inaddition to basic simulation and debugging, it provides many advancedfeatures for responding to processing events.
The SDK contains many code libraries to ease the transition fromtraditional programming to Cell development. It provides most standardC/C++ libraries for both the PPU and SPU, POSIX commands for the PPU,and a subset of the POSIX API on the SPU. Many of the libraries arerelated to math, but others can be used to profile an SPU's operation,maintain a software cache, and synchronize communication betweenprocessing units.
All of these tools and libraries can be accessed through the CellSDK integrated development environment (IDE).This is an Eclipse-basedgraphical user interface for managing, editing, building, and analyzingcode projects. It provides a powerful text editor for code entry,point-and-click compiling, and a feature-rich interface to the Celldebugger. With this interface, you can watch variables as you stepthrough code and view every register and memory location in the Cell.
Some time ago, I had the pleasure of programming assembly language on amulticore digital signal processor, or DSP. The DSP performed matrixoperations much,much faster than the computer on my desk, but therewere two problems: I had to write all the routines for resourcemanagement and event handling, and there was no file system to organizethe data.And without a network interface, it was hard to transfer datain and out of the device.
The Cell makes up for these shortcomings and provides manyadditional advantages. With SIMD processing, values can be grouped intovectors and processed in a single cycle. With Linux running on the PPE,memory and I/O can be accessed through a standard, reliable API. Mostimportant, when all the SPEs crunch numbers simultaneously, they canprocess matrices at incredible speed.
The goal is to enable you to build applications with similarperformance. As with the DSP, however, it's not enough just to know theC/C++ functions. You have to understand how the different processingelements work, how they're connected, and how they access memory. Butfirst, you need to know how to use the tools.
Next in Part 2, Building Applications for the Cell Processor.
Matthew Scarpino lives in the San Franciso Bay area anddevelops software to interface embedded devices. He has a master'sdegree in electrical engineering and has spent more than a decade insoftware development. His experience includes computing clusters,digital signal processors, microcontrollers and fiedld programmablegate arrays and, of course, the Cell Processor.
This series of articles is reproduced from the book “Programmingthe Cell Processor”, Copyright © 2009, by permission ofPearson Education, Inc.. Written permission from Pearson Education,Inc. is required for all other uses.
To read more about the Cell processor architecture on Embedded.com,go to: