Down & dirty with HW/SW co-design: Part 1 - Reviewing the fundamentals. -

Down & dirty with HW/SW co-design: Part 1 – Reviewing the fundamentals.

Embedded computing systems must meet tight cost, power consumption, and performance constraints. If one design requirement dominated, life would be much easier for embedded system designers-they could use fairly standard architectures with easy programming models.

But because the three constraints must be met simultaneously, embedded system designers have to mold hardware and software architectures to fit the needs of applications. Specialized hardware helps to meet performance requirements for lower energy consumption and at less cost than would be possible from a general-purpose system.

As we have seen, embedded computing systems are often heterogeneous multiprocessors with multiple CPUs and hardwired processing elements (PEs). In co-design, the hardwired PEs are generally called accelerators. In contrast, a co-processor is controlled by the execution unit of a CPU.

Hardware/software co-design is a collection of techniques that designers use to help them create efficient application-specific systems.

If you don't know anything about the characteristics of the application, then it is difficult to know how to tune the system's design. But if you do know the application, as the designer, not only can you add features to the hardware and software that make it run faster using less power.

But you also can remove hardware and software elements that do not help with the application at hand. Removing excess components is often as important as adding new features.

As the name implies, hardware/software co-design means jointly designing hardware and software architectures to meet performance, cost, and energy goals. Co-design is a radically different methodology than the layered abstractions used in general-purpose computing.

Because co-design tries to optimize many different parts of the system at the same time, it makes extensive use of tools for both design analysis and optimization.

Increasingly, hardware/software co-design is being used to design nonembedded systems as well. For example, servers can be improved with specialized implementations of some of the functions on their software stack. Co-design can be applied to Web hosting just as easily as it can be applied to multimedia.

In this series we will first take a brief look at some hardware platforms that can be used as targets for hardware/software co-design, followed by examination of performance analysis, hard ware/software co-synthesis and finally, hardware/software cosimulation.

Design Platforms
Hardware/software co-design can be used either to design systems from scratch or to create systems to be implemented on an existing platform. The CPU+ accelerator architecture is one common co-design platform. A variety of different CPUs can be used to host the accelerator.

The accelerator can implement many different functions; furthermore, it can be implemented using any of several logic technologies. These choices influence design time, power consumption, and other important characteristics of the system.

The co-design platform could be implemented in any of several very different design technologies.

1) A PC-based system with the accelerator housed on a board plugged into the PC bus. The plug-in board can use a custom chip or a field programmable gate array (FPGA) to implement the accelerator. This sort of system is relatively bulky and is most often used for development or very low-volume applications.

2) A custom-printed circuit board , using either an FPGA or a custom integrated circuit for the accelerator. The custom board requires more design work than a PC-based system but results in a lower-cost, lower-power system.

3) A platform FPGA that includes a CPU and an FPGA fabric on a single chip. These chips are more expensive than custom chips but provide a single-chip implementation with one or more CPUs and a great deal of custom logic.

4) A custom integrated circuit, for which the accelerator implements a function in less area and with lower power consumption. Many embedded systems-on-chips (SoC) make use of accelerators for particular functions.

The combination of a CPU plus one or more accelerators is the simplest form of heterogeneous platform; hardware/software partitioning targets such platforms. The CPU is often called the host. The CPU talks to the accelerator through data and control registers in it. These registers allow the CPU to monitor the accelerator's operation and to give it commands.

The CPU and accelerator can also communicate via shared memory. If the accelerator needs to operate on a large volume of data, it is usually more efficient to leave the data in memory and to have the accelerator read and write memory directly rather than to have the CPU shuttle data from memory to accelerator registers and back. The CPU and accelerator synchronize their actions.

More general platforms are also possible. We can use several CPUs, in contrast to the single processor attached to the accelerators. We can generalize the system interconnect from a bus to more general structures.

Plus, we can create a more complex memory system that provides different types of access to different parts of the system. Co-designing such types of systems is more difficult, particularly when we do not make assumptions about the structure of the platform.

Three examples below describe several different co-design platforms that use FPGAs in different ways.

Example 7-1. The Xiilnx Virtex-4 FX Platform FPGA Family
The Xilinx Virtex-4 family [Xi105] is a platform FPGA that comes in several different configurations. The higher-end FX family includes one or two PowerPC processors, multiple Ethernet MACS, block RAM, and large arrays of reconfigurable logic.

The PowerPC is a high-performance 32-bit RISC machine with a five-stage pipeline, 32 general-purpose registers, and separate instruction and data caches. The FPGA fabric is built on configurable logic blocks (CLBs) that use lookup tables and a variety of other logic.

The largest Virtex-4 provides up to 200,000 logic cells. The CLBs can be used to implement high-speed adders. A separate set of blocks includes 18 x 18 multipliers, an adder, and a 48-bit accumulator for DSP operations. The chip includes a number of RAM blocks that can be configured to a variety of depth and width configurations.

The PowerPC and FPGA fabric can be tightly integrated. The FPGA fabric can be used to build bus-based devices. An auxiliary processor unit allows custom PowerPC instructions to be implemented in the FPGA fabric.

In addition, processor cores can also be implemented in the FPGA fabric to build heterogeneous multiprocessors. Xilinx provides the MicroBlaze processor core and other cores can be used as well.

Example 7-2. The ARM Integrator Logic Module
The ARM Integrator is a series of evaluation boards for the ARM processor. The Integrator Logic Module [ARM00] is an FPGA accelerator board that plugs into the ARM Integrator motherboard.

The logic module provides a Xilinx FPGA for reconfigurable logic. The FPGA interfaces to the ARM AMBA bus. The logic module board does not contain its own SRAM-the FPGA can use the AMBA bus to connect to the SRAM and I/O devices contained on other boards.

Example 7-3. The Annapolis Micro Systems WILDSTAR II Pro
The WILDSTAR II Pro is a PCI bus card that provides FPGA logic on the bus of a PC or other PCI device. The card hosts one or two Virtex II Pro FPGAs that can connect directly to the PCI bus.

It also hosts up to 96 MB of SRAM and 256 MB of SDRAM. The card is organized as shown in the figure. A development environment simplifies system design for the board.

During co-synthesis, we need to evaluate the hardware being designed. High-level synthesis was developed to raise the level of abstraction for hardware designers, but it is also useful as an estimation tool for co-synthesis of hardware and software.

This section surveys some high-level synthesis algorithms to provide a better understanding of the costs of hardware and fast algorithms for area and performance estimation. We also look at techniques specifically developed to estimate the costs of accelerators in heterogeneous multiprocessors.

High-Level Synthesis. High-level synthesis starts from a behavioral description of hardware and creates a register-transfer design. High-level synthesis schedules and allocates the operations in the behavior as well as maps those operations into component libraries.

Figure 7-1 An example of behavior specification and register-transfer implementation.

Figure 7-1 above shows a simple example of a high-level specification and one possible register-transfer implementation. The data dependency edges carry variable values from one operation or from the primary inputs to another.

The register-transfer implementation shows which steps need to be taken to turn the high-level specification, whether it is given as text or a data flow graph, into a register-transfer implementation:

Operations have been scheduled to occur on a particular clock cycle. Variables have been assigned to registers. Operations have been assigned to function units. Some connections have been multiplexed to save wires.

In this case, it has been assumed that we can execute only one operation per clock cycle. A clock cycle is often called a control step or time step in high level synthesis.

We use a coarser model of time for high-level synthesis than is used in logic synthesis. Because we are farther from the implementation, delays cannot be predicted as accurately, so detailed timing models are not of much use in high-level synthesis. Abstracting time to the clock period or some fraction of it makes the combinatorics of scheduling tractable.

Allocating variables to registers must be done with care. Two variables can share a register if their values are not required at the same time-for example, if one input value is used only early in a sequence of calculations and another variable is defined only late in the sequence.

But if the two variables are needed simultaneously they must be allocated to separate registers. Sharing either registers or function units requires adding multiplexers to the design.

For example, if two additions in the data flow graph are allocated to the same adder unit in the implementation, we use multiplexers to feed the proper operands to the adder.

The multiplexers are controlled by a control FSM, which supplies the select signals to the muxes. (In most cases, we don't need de-multiplexers at the outputs of shared units because the hardware is generally designed to ignore values that aren't used on any given clock cycle. ) Multiplexers add three types of cost to the implementation:

1. Delay, which may stretch the system clock cycle.
2. Logic, which consumes area on the chip.
3 . Wiring, which is required to reach the multiplexer, also requires area.

Sharing hardware isn't always a win. For example, in some technologies, adders are sufficiently small that you gain in both area and delay by never sharing an adder. Some of the information required to make good implementation decisions must come from a technology library, which gives the area and delay costs of some components.

Other information, such as wiring cost estimates, can be made algorithmically. The ability of a program to accurately measure implementation costs for a large number of candidate implementations is one of the strengths of high-level synthesis algorithms.

When searching for a good schedule, the as-soon-as-possible (ASAP) and as-late-as-possible (ALAP) ones are useful bounds on schedule length.

A very simple heuristic that can handle constraints is first-come-first-served (FCFS) scheduling. FCFS walks through the data flow graph from its sources to its sinks.

As soon as it encounters a new node, it tries to schedule that operation in the current clock schedule; if all the resources are occupied, it starts another control step and schedules the operation there.

FCFS schedules generally handle the nodes from source to sink, but nodes that appear at the same depth in the graph can be scheduled in arbitrary order. The quality of the schedule, as measured by its length, can change greatly depending on exactly which order the nodes at a given depth are considered.

FCFS, because it chooses nodes at equal depth arbitrarily, may delay a critical operation. An obvious improvement is a critical-path scheduling algorithm, which schedules operations on the critical path first.

List scheduling is an effective heuristic that tries to improve on critical-path scheduling by providing a more balanced consideration of off-critical-path nodes.

Rather than treat all nodes that are off the critical path as equally unimportant, list scheduling estimates how close a node is to being critical by measuring D, the number of descendants the node has in the data flow graph. A node with few descendants is less likely to become critical than another node at the same depth that has more descendants.

List scheduling also traverses the data flow graph from sources to sinks, but when it has several nodes at the same depth vying for attention, it always chooses the node with the most descendants.

In our simple timing model, where all nodes take the same amount of time, a critical node will always have more descendants than any noncritical node. The heuristic takes its name from the list of nodes currently waiting to be scheduled.

Force-directed scheduling [Pau89] is a well-known scheduling algorithm that tries to minimize hardware cost to meet a particular performance goal by balancing the use of function units across cycles. The algorithm selects one operation to schedule using forces (see Figure 7-2 below ).

It then assigns a control step to that operation. Once an operation has been scheduled, it does not move, so the algorithm's outer loop executes once for each operation in the data flow graph.

Figure 7.2. How forces guide operator scheduling
To compute the forces on the operators, we first need to find the distributions of various operations in the data flow graph, as represented by a distribution graph.

The ASAP and ALAP schedules tells us the range of control steps at which each operation can be scheduled. We assume that each operation has a uniform probability of being assigned to any feasible control step.

A distribution graph shows the expected value of the number of operators of a given type being assigned to each control step, as shown in Figure 7-3 below.

The distribution graph gives us a probabilistic view of the number of function units of a given type (adder in this case) that will be required at each control step. In this example, there are three additions, but they cannot all occur on the same cycle.

Figure 7.3. Example distribution graph for force-directed scheduling
If we compute the ASAP and ALAP schedules, we find that +1 must occur in the first control step, +3 in the last, and +2 addition can occur in either of the first two control steps.

The distribution graph DG+ (t) shows the expected number of additions as a function of control step; the expected value at each control step is computed by assuming that each operation is equally probable at every legal control step.

We build a distribution for each type of function unit that we will allocate. The total number of function units required for the data path is the maximum number needed for any control step.So, so minimizing hardware requirements requires choosing a schedule that balances the need for a given function unit over the entire schedule length.

The distribution graphs are updated each time an operation is scheduled-when an operation is assigned to a control step, its probability at that control step becomes 1 and at any other control step 0.

As the shape of the distribution graph for a function unit changes, force-directed scheduling tries to select control steps for the remaining operations, which keeps the operator distribution balanced.

Force-directed scheduling calculates forces like those exerted by springs to help balance the utilization of operators. The spring forces are a linear function of displacement, as given by Hooke's Law:

F(x) = kx,             (EQ 7-1)  

where x is the displacement and k is the spring constant, which represents the spring's stiffness.

When computing forces on the operator we are trying to schedule, we first choose a candidate schedule time for the operator, then compute the forces to evaluate the effects of that scheduling choice on the allocation.

There are two types of forces applied by and on operators: self forces and predecessor/successor forces. Self forces are designed to equalize the utilization of function units across all control steps.

Since we are selecting a schedule for one operation at a time, we need to take into account how fixing that operation in time will affect other operations, either by pulling forward earlier operations or pushing back succeeding ones.

When we choose a candidate time for the operator being scheduled, restrictions are placed on the feasible ranges of its immediate predecessors and successors. (In fact, the effects of a scheduling choice can ripple through the whole data flow graph, but this approximation ignores effects at a distance.)

The predecessor/successor forces, Po (t) and X(t), are these imposed on predecessor/successor operations. The scheduling choice is evaluated based on the total forces in the system exerted by this scheduling choice: the self forces, the predecessor forces, and the successor forces are all added together.

That is, the predecessor and successor operators do not directly exert forces on the operator being scheduled, but the forces exerted on them by that scheduling choice help determine the quality of the allocation.

At each step, we choose the operation with the lowest total force to schedule and place it at the control step at which it feels the lowest total force.

Path-based scheduling [Cam91] is another well-known scheduling algorithm for high-level synthesis. Unlike the previous methods, path-based scheduling is designed to minimize the number of control states required in the implementation's controller, given constraints on data path resources.

The algorithm schedules each path independently, using an algorithm that guarantees the minimum number of control states on each path. The algorithm then optimally combines the path schedules into a system schedule.

The schedule for each path is found using minimum clique covering; this step is known by the name as-fast-as-possible (AFAP) scheduling.

Accelerator Estimation
Estimating the hardware costs of an accelerator in a heterogeneous multiprocessor requires balancing accuracy and efficiency. Estimates must be good enough to avoid misguiding the overall synthesis process.

However, the estimates must be generated quickly enough that co-synthesis can explore a large number of candidate designs. Estimation methods for co-synthesis generally avoid very detailed estimates of physical characteristics, such as might be generated by placement and routing. Instead, they rely on scheduling and allocation to measure execution time and hardware size.

Hermann et al. [Her94] used numerical methods to estimate accelerator costs in COSYMA. They pointed out that as additional layers of a loop nest are peeled off for implementation in the accelerator, the cost of the modified accelerator is not equal to the sum of the cost of the new and additional pieces.

For example, if a function is described in three blocks, then eight terms may be necessary to describe the cost of the accelerator that implements all three blocks. They used numerical techniques to estimate accelerator costs as new blocks were added to the accelerator.

Henkel and Ernst [Hen95; Hen01] developed an algorithm to quickly and accurately estimate accelerator performance and cost given a CDFG. They use path-based scheduling to estimate the length of the schedule and the resources required to be allocated.

To reduce the runtime of the AFAP and overlapping steps of path-based scheduling, they decompose the CDFG and schedule each subgraph independently; this technique is known as path-based estimation. Theyr use three rules to guide the selection of cut points.

First, they cut at nodes with a smaller iteration count, since any penalty incurred by cutting the graph will be multiplied by the number of tritons.

Second, they cut at nodes where many paths join, usually generated by the end of a scope in the language. Third, they try to cut the graph into roughly equal-size pieces. The total number of cuts it is possible to make can be controlled by the designer.

Their estimation algorithm, including profiling, generating, and converting the CDFG; cutting the CDFG; and superposing the schedules is shown in Figure 7-4 below.

Vahid and Gajski [Vah95] developed an algorithm for fast incremental estimation of data path and controller costs. They model hardware cost as a sum

CH = kFU SFU + kSU SSU + kM SM + kSR SSR + kC SC = kW SW (EQ 7-2)

where Sx is the size of a given type of hardware element and kx is the size-to-cost ratio of that type. In this formula, FU is function units. SU is storage units. M is multiplexers. SR is state registers. C is control logic. W is wiring.

Figure 7-4 An algorithm for path-based estimation.
They compute these from more direct parameters such as the number of states in the controller, the sources of operands for operations, and so on. Vahid and Gajski preprocess the design to compute a variety of information for each procedure in the program.

1) A set of data path inputs and a set of data path outputs.
2) The set of function and storage units. Each unit is described by its size and the number of control lines it accepts.
3) A set of functional objects is described by the number of possible control states for the object and the destinations to which the object may write. Each destination is itself described with an identifier, the number of states for which the destination is active for this object, and the sources that the object can assign to this destination.

Figure 7-5 Preprocessed information used for cost estimation.
A tabular form of this information is shown in Figure 7-5 above. This information can be used to compute some intermediate values as shown in Figure 7-6 below . These values in turn can be used to determine the parameters in (EQ 7-2) .

Vahid and Gajski use an update algorithm given a change to the hardware configuration. Given a design object, they first add the object to the design if it does not yet exist. They then update multiplexer sources and sizes and update control line active states for this destination.

Click on image to enlarge.

Figure 7-6 Tabular representation of hardware resources.

Finally, they update the number of controller states. Their update algorithm executes in constant time if the number of destinations per object is approximately constant.

To read Part 2 , go to: Hardware/software co-synthesis algorithms
To read Part 3 , go to: Co-Synthesis of multiprocessors
To read Part 4 , go to: Multi-objective optimization

This series of four articles is based on material printed with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2007 from “High Performance Embedded Computing” by Wayne Wolf. For more information about this title and other similar books, please visit

Wayne Wolf is currently the Georgia Research Alliance Eminent Scholar holding the Rhesa “Ray” S. Farmer, Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's School of Electrical and Computer Engineering (ECE). Previously a professor of electrical engineering at Princeton University, he worked at AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.