Designing custom embedded multicore processors - Embedded.com

Designing custom embedded multicore processors

There are “multi” paths a designer can take to get the needed performance.

As embedded applications have proliferated, increasing performance demands have outstripped the ability of conventional single processors to provide effective solutions. The high clock speeds needed to achieve the necessary performance require increasingly expensive semiconductor process technologies, precision board layout and manufacturing, and sophisticated heat removal to handle the increased power demands of such devices. Embedded designers have turned, instead, to multiprocessors, either combining several conventional processors or augmenting a processor with an application-specific coprocessor or a DSP.

The core-based design approach has fueled this movement to multiprocessor designs. The use of cores has simplified the creation of custom ASICs as well as application-specific standard products (ASSPs) that integrate combinations of processors, coprocessors, and DSPs into one device, and such products have proliferated. At the same time, programmable logic has adopted the core-based approach, offering development teams substantial libraries of soft-core processors and other sophisticated functions for rapid development of integrated multiprocessor designs. The diversity of choice has given developers a range of cost and performance options for system development.

Multiprocessor design for embedded systems, whether implemented in discrete devices or as a multicore approach, uses many architectures. Perhaps surprisingly, the original approach of symmetric multiprocessing (SMP), where a collection of identical processors uses sophisticated software partitioning and scheduling to address various software tasks, hasn't been widely used in embedded systems. SMP proved useful in desktop computing and workstation applications where the system must be prepared to handle a wide range of applications software. Embedded systems, however, have much tighter constraints, such as power, cost, and size, to which SMP is generally unsuitable. Embedded systems are better served by other architectures.

Architectures for embedded multiprocessing
One common approach to multiprocessing in embedded control applications is the use of independent processors, each dedicated to performing a single function. A typical system, like the one shown in Figure 1, would have a main processor handling the application code while secondary processors handle system functions. Thus, the main processor might concentrate on receiving and processing data while the secondary processors monitor equipment status and control the power and cooling functions. This architecture best suits situations where the various tasks need little, if any, coordination and removing these tasks from the main processor frees up enough capability for it to handle the main application single-handed.

View the full-size image

The consumer product example shown in Figure 2 uses a variation of the multiple distributed processor approach to handle a complex application with tasks that exchange substantial amounts of data but are otherwise independent. The approach allows the assignment of individual processors to major tasks that would otherwise be running on one embedded processor. Instead of using a single high-performance processor, the approach uses a collection of processors, each matched in performance to the task's requirements. The benefits derived include lower power consumption, better design reuse, reduced software complexity, better software maintainability, and simpler software debug.

View the full-size image

Channelization, as shown in Figure 3, is another embedded multiprocessing design approach, one designed to achieve high data throughput for specific operations. Multiple processors in a single chip, each dedicated to handling a portion of the overall channel throughput, may run the exact same code or change algorithms on the fly to adapt to system requirements. A master processor handles general housekeeping chores such as system initialization, statistics gathering, and error handling. This approach offers scalability to extremely high performance levels by increasing the number of channels.

The link-controller approach uses multiple processors to simplify the operation of a system data network by serving as link controllers between the node and the network. In this architecture, the processors regulate access to network resources by other units. The link controllers don't manipulate the data, they simply serve as the conduit between a data processing unit and system memory. They thus serve to offload the network interface functions from the data processing units, simplifying the design of these other system elements.

One common multiprocessing architecture used in embedded systems design, however, employs a coprocessor to offload a critical task from the system's main application processor. The offloaded task may be compute-intensive, such as a discrete cosine transform (DCT) for image processing, or I/O intensive, such as multichannel data acquisition. Unlike the independent processor approach, however, the coprocessor approach requires substantial data exchange among the processors. This may involve shared memory resources, a high-bandwidth data link between processors, or some combination of the two.

Coprocessor approaches
There are many ways to implement the coprocessor approach. One is to use an ordinary CPU as the additional processor. This CPU can be a fixed device or implemented as a soft core on an FPGA. Developers program the device to handle whatever tasks need to be off-loaded from the main processor.

A second approach is to use application-specific logic as the coprocessor. Common examples of this approach are the use of a graphics coprocessor to drive high-performance displays or a DSP to handle audio or image processing. As with the ordinary CPU, developers must program the coprocessor to handle the off-loaded task.

A third approach uses hard-wired logic for high-speed execution of a specific operation. The logic can be fixed in silicon or programmed into an FPGA, but it doesn't need any software. While the logic serves the function of a coprocessor, it's more in the nature of algorithmic intellectual property (IP) than a CPU.

The algorithmic IP approach, sometimes called “hardware acceleration,” is applicable to a wide variety of applications. As shown in Figure 4, the “coprocessor” can be as simple as a Viterbi decoder used in high-speed communications. In more complex systems, multiple coprocessors may be used as a graphics engine composed of several different, application-specific acceleration blocks. A well-known example of the algorithmic IP approach is the Freescale QUICCEngine communications coprocessor. This device provides hard-wired logic for implementing a number of different communications protocols to relieve an applications processor from the details of data transfers.

View the full-size image

Compared with other multicore implementations, the algorithmic IP approach lets programmable logic shine. Custom silicon is increasingly expensive to develop while ASSPs are inflexible when designers want to differentiate their product or respond to changing market requirements. In addition, both approaches require significant development time that can delay market entry while waiting for silicon to become available. Implementation in an FPGA has neither drawback. It allows rapid implementation of custom designs that let developers leverage their specific expertise.

Typically, for a given process technology, a CPU implemented in an FPGA won't operate at as high a clock speed as a dedicated device. The algorithmic approach can effectively bypass this perceived limitation, however. By analyzing the application code and looking for performance bottlenecks, developers can off-load processing-intense tasks into FPGA logic rather than keeping them as software running on the CPU.

Tools are available to quickly convert time-critical ANSI C code into custom hardware accelerators, letting the developer rapidly explore hardware-software tradeoffs. Such off-loading not only captures the obvious algorithmic tasks that would go to a coprocessor, it can capture unexpected bottlenecks that wouldn't be cost-effective to implement in an ASIC or ASSP, but are readily incorporated into FPGA-based designs. This approach yields a multicore design exactly tuned to the application software. The result, even with a modest clock speed, can be an increase in overall system performance as compared with a conventional processor and coprocessor.

An FPGA's flexibility in implementing algorithmic IP can be of particular benefit when the application requires the ability to run different algorithms depending on operating mode. Developers can analyze each mode and prepare an FPGA configuration that optimizes the system for its algorithm. Because the FPGA is reprogrammable, the system can select among and implement the various configurations as needed. There's no compromise.

Hardware-Software Inversion
One reason that the algorithmic IP approach to multicore design is growing in popularity is that increasing system complexity threatens to invert traditional hardware/software partitioning. Because software is becoming the major component of embedded system design, it's getting larger and more difficult to create, as shown in Figure 5. In addition, analysis of large programs shows that they're more error-prone than smaller programs, with error rates increasing as the code becomes larger. This means that software costs increase exponentially with code size. Further, teams are becoming reluctant to alter system software once it's debugged and running; the risk of introducing crippling errors is too high.

View the full-size image

Hardware design, on the other hand, is becoming easier to create. With the availability of tested and debugged cores, as well as the growing capacity of programmable logic, what was once cast in concrete at the beginning of a project is now amenable to change late in the design. In many cases, it has become easier to add or alter hardware to handle system modifications than it is to change the programming. Hardware has become “softer” while software has become more inflexible.

As a result of this inversion, developers are choosing design approaches that help reduce the size of system software. The traditional multicore approach of multiple processors helps by partitioning the code into smaller blocks, but runs the risk of increasing code size to handle interprocess communications between the cores. It also complicates code debugging because of the need to provide simultaneous coordinated control of multiple processors.

The algorithmic IP approach, however, reduces code size by implementing functions in hardware rather than software but adds no communications overhead. The processor simply hands the hardware the data and picks up the results. In many cases, the hardware can be the last stage of processing, so that even pickup is eliminated. Because the coprocessor functions are in hardware, there are no additional software streams to coordinate and debugging is easier.

Standards needed to ease design
Implementing multiprocessor designs as multicore devices, whether incorporating multiple CPUs or using algorithmic IP, still presents some challenges. One of these is a relative lack of standards for interfacing the various processors or cores. The industry has made some efforts in this arena by developing common interfaces. And connecting cores has become easier. However, the “standards” still need work. In practice, the interfaces are similar but still require effort to make the cores work well together. Until true standards emerge, multicore systems will continue to “mostly fit” together.

One way to ease the existing situation is to use development tools that help ease the pain of adapting interfaces to accommodate minor differences. System-level tools can automate the task of wiring multiple processors together. This helps to eliminate errors and enables hardware designers to focus on their unique additions to the project.

Despite the availability of tools to ease the situation, however, these interface standards merit refinement. But the payoff can be significant. Well-defined and well-known standards help decouple engineering efforts. Design teams can develop system blocks independently with reasonable assurance that the pieces will fit together when complete. In today's industry of engineering teams scattered worldwide, this decoupling can pay handsome dividends by reducing travel and collaboration costs.

One area that still needs standardization is processor-to-processor interaction. Operating systems offer methods for scheduling tasks, sharing data among tasks, and protecting tasks from one another. Having equivalent, predefined hardware structures can greatly improve design efficiency. Standard “components” could simplify processor-to-processor communications and processor synchronization. They could also help ease software debug by providing a mechanism for starting, stopping, and stepping processors in an orderly fashion.

Software interfaces can also use some standardization. Replacing a large, multitasking program running on one processor with multiple processors running a subset of the tasks requires partitioning of the software and developing communications mechanisms that allow tasks to exchange information. Having standard interfaces for intertask data exchanges and processor messaging could pay large dividends in developer productivity.

The embedding of standard debugging tools can also support productivity. When looking for bottlenecks that impair system performance, teams must be able to measure such things as interface bandwidth, arbitration effectiveness, cache misses, and the like. Structures to capture that kind of detail offer best performance when built into the hardware and providing links to external tools for analysis and control. For tool vendors to create such external tools, however, standard functions and interfaces must be implemented.

Standards for coordinating and exchanging information across a range of tools would also be useful. As shown in Figure 6, various tools are needed to debug a multicore design. A standardized means of exchanging information among the tools can significantly enhance designer productivity. Developers using FPGAs have coordinated toolsets available but traditional core-based design tools need improvement.

View the full-size image

As a senior product manager, Bob Garrett is responsible for Altera's IP-based embedded processor solutions, including the Nios II microprocessors, embedded software tools, and the C-to-hardware acceleration compiler. He can be reached at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.