Embedded system design is a dance between software and hardware. The question is, which of the two gets to call the tune? Who leads? Who controls the relationship?
On the one hand, it is the hardware that really does the work. On the other hand, the ultimate functionality of the system tends to be in the software, with the hardware supporting that effort. Hardware and software engineers may have differing opinions as to which of those spins is more accurate.
In reality, system hardware and software are defined together at a high level. Board content, memory size, I/O, and other such pieces of support hardware are provided to address the needs of the functions to be executed by the software. But at a lower level, once a processing engine is picked, the on-chip computational resources are fixed and immutable.
Any changes to this fixed structure have to be managed through patching with separate accelerator chips. As soon as the processor decision is concrete, there is a shift from a process of designing hardware to meet the software needs to a process of working the software to ensure it fits within the constraints of the processor or processors (in the case of a multi-core implementation).
This is intensified in embedded design, where resources are relatively scarce and performance may be mission-critical. From here on, the hardware leads the dance, and the software follows. The software becomes an enabler in a codependent relationship that caters to the sometimes unreasonable needs of the hardware.
In an effort to minimize the pain of this effort, it is easy to over-provision the hardware to avoid being trapped later; this could result in a system with more hardware resource than required, raising costs.
Alternatively, if extra resources aren't used, there is a risk that the needs of the software exceed the available hardware, and a larger processor must be swapped in later; the software will get tired of having its feet stepped on and will call for a new dance partner.
Most specialized processors like network processors have few family members, and the next increment up could be much larger than the small extra increment needed, raising the cost and wasting the extra resources. The alternative is to add more chips for external acceleration, adding an I/O bottleneck to processing.
An FPGA with multiple soft embedded processors provides a way to avoid this power imbalance. Because it's programmable, hardware resources can be tuned much more precisely to the needs of the software.
And because FPGAs have such wide applicability, the economics allow the offering of a fuller set of sizes, so that running out of room in one device means a more modest jump to the next device size. This can be easily illustrated by looking at packet processing subsystems, which often use specialized multi-core processing elements to meet the needs of multi-gigabit line rate processing.
In a discussion of multi-core processing, words like “core”, “processor”, and “engine” tend to get used a lot, sometimes synonymously, sometimes not. For clarity, this article will use the word “core” to refer specifically to a microprocessor, along with support logic or memory, that gets implemented multiple times in a multi-core configuration. The word “engine” will refer to the structure that is made up out of one or more cores. The word “processor” will be used in the traditional sense of a single CPU.
Picking apart a multicore packet processor
There are different kinds of hardware that get commandeered in a packet processing engine including ports, the fast path engine, the control plane processor, code and data store, accelerators, external memory controllers and peripherals.
All of these elements can be built in an FPGA. In some cases the hardware cost is in logic gates, in some cases in memory, and in some cases in pins. Many cases involve a combination of these.
There is always the constraint that the sum total of the resources used must fall within the number provided in a given device, but the power comes in being able to dedicate lower-level resources like gates to one function or another rather than having them dedicated to a function that may or may not ever be used.
In a multicore or multiprocessor based design, especially in packet processing systems, the benefits of a softcore FPGA approach are most apparent in the fast path engine, the control plane processor and in how accelerators are deployed in such environments.
The fast path engine
The fast path of a packet processor is typically implemented in a multi-core fabric for highest performance. The number of cores required depends on the amount of functionality required, and varies from application to application. Even for a given application, such parameters as line utilization, the packet size for which performance is specified, and the mix of packets can affect the number of cores needed.
A fixed architecture allows access to a fixed set of cores, usually in a fixed configuration. The task for the user is then to partition the code in a way that balances the use of the cores while remaining within pipeline and code store limitations.
If the performance requirements are light, then some cores are un- or under-utilized. On the other hand, if it turns out that the number of cores provided is insufficient for the necessary performance and margin, then a larger device must be selected. Often this device will have double the number of cores.
Unless the initial core count estimates were grossly off, the number of additional cores required to make performance should increase by one, or perhaps two. Doubling from four to eight cores, for example, would be significant overkill if only one extra core is needed.
A flexible multi-core fabric opens up a completely new way of approaching the design problem. Rather than trying to shoehorn software into a fixed number of processors, one can assign the number of cores according to the requirements of the software. Those requirements can be determined by understanding the cycle count profile of the code and the overall cycle budget.
Given that information, the number of required cores can be accurately estimated. Even if an extra core is needed later, it is a straightforward matter to add one to the design of the FPGA, given adequate resources in the device. And if the device has to be upgraded, it can be done without doubling the resource count.
The other benefit of a programmable fabric is that the pipeline can be created in a variety of configurations to balance the stages of the pipeline. It is hard to create stages that exactly split the cycle counts evenly.
The result is that some stages will require more cycles than others, putting the stages out of balance. In reality, a slow stage may require three parallel cores to meet cycle budget; an adjacent faster stage may only require two. As long as the interconnecting communication blocks can provide load balancing, this can be easily accommodated in an FPGA.
|Figure One. FPGAs allow irregular pipelines using load-balancing|
The control plane processor
Most network and communications processors have a processor that is intended for use as the control plane processor. Control plane code must be adapted to that processor.
While some FPGAs have built-in processors (the Xilinx Virtex 4 FX family includes one or two PowerPC processors), any FPGA can be used, and any processor can be connected to the FPGA.
An external control plane processor will take some board space and require I/O between chips, but the FPGA then doesn't have to include the processor, so this becomes a true tradeoff. With a fixed architecture, you get the control plane processor whether used or not, and there's no provided way to hook in an external processor.
Hardware acceleration is essential for pieces of many algorithms. There are two kinds of function that may have to be accelerated:(1) Computationally-intensive functions, like encryption or checksum calculation; and, (2) Long-lead items, like external memory access. The intent is to speed up slow items. But with fixed-architecture solutions, however, there are a number of challenges:
* Any existing on-chip accelerators are there whether needed or not
* Accelerators typically have to be shared
* Accelerators not built-in must be generated on a separate chip
The first case is one of potential wasted silicon. As an example, some network processors will provide a checksum accelerator. If a particular application doesn't require checksums, then that bit of silicon adds cost but doesn't add value.
Assuming the checksum has more than one core in parallel, then both of those cores need access to the accelerator, so the accelerator is shared between them. Sharing an accelerator can be fine if the “duty cycle” of the accelerator with respect to the cycle budget is short.
But if that duty cycle is long, then overall performance will suffer due to contention between different cores trying to access the same accelerator. The engineer needs the flexibility to make decisions as to whether and how to share the accelerators.
|Figure Two. Shared accelerators can result in reduced performance due to contention|
Another situation is one where the application requires checksums but at a rate faster than what the built-in accelerator can provide. Or perhaps the application doesn't require checksums but requires some other kind of acceleration. One must then go off-chip for acceleration, either using an off-the-shelf accelerator or building one in an FPGA.
Because of the overhead of going off-chip, such an accelerator typically must be shared, meaning that the performance takes a double hit due both to contention and to the overhead of going off-chip. Meanwhile the built-in accelerator goes unused.
With an FPGA, each of these issues can be addresses simply through the inherent flexibility of the technology, in that (1) only accelerators that are needed are created; (2) accelerators can be shared or not; and (3) accelerators exist on-chip, with the cores that use them
So if an application requires no checksum, then no checksum accelerator is created. If an application requires checksums and can withstand sharing, sharing is possible; if sharing hurts performance, then multiple accelerators can be created, with each core having a private accelerator. No off-chip delays are required since the accelerator is created on the same FPGA.
|Figure Three. FPGAs allow any combination of the above multicore configurations|
FPGAs allow one more level of flexibility: the scheduling of the accelerated function. “Accelerators” in the traditional sense are synchronous: the core that called them waits for the result before continuing.
An alternative is a coprocessor, which is asynchronous: the calling core can work on other tasks while the coprocessor executes. There is room for confusion in the term “coprocessor”, however, in that it may appear to be a software-oriented unit. Therefore Teja uses the term “offload” to indicate a function that is handed from a core to a hardware unit. Offloads in FPGAs can be designed as synchronous or asynchronous; this is an engineering decision made at design time.
|Figure Four. FPGA offloads can be synchronous or asynchronous|
Hardware flexibility on a wide variety of fronts allows the best performance/cost trade-offs to be made. The control of those decisions lies in the hands of the engineer designing the system.
By keeping the bulk of the processing algorithm in software, and combining that with flexible hardware, fewer hard take-them-or-leave-them constraints are placed on the system, meaning that good engineering decisions will allow convergence to line rate much more quickly.
And because only hardware that is needed will be instantiated, cost can be controlled and traded off against performance explicitly by the designer. With more give and take, the dance between hardware and software becomes more accommodating, with hardware calling some tunes and software calling others, but no control issues, no codependency, no lingering resentments, and both can enter into and coexist in a happy marriage for a long and useful life.
Bryon Moyer is VP of Product Marketing for Teja Technologies, Inc.