Down & dirty with HW/SW co-design: Part 2 - Co-synthesis algorithms - Embedded.com

Down & dirty with HW/SW co-design: Part 2 – Co-synthesis algorithms

When designing a distributed embedded system, software AND hardware developers need to deal with several design problems:

Problem #1: We must schedule operations in time, including communication on the network and computations on the processing elements. Clearly, the scheduling of operations on the PEs and the communications between the processing element are linked.

If one PE finishes its computations too late, it may interfere with another communication on the network as it tries to send its result to the processing elements that needs it. This is bad for both the processing elements that needs the result and the other PEs whose communication is interfered with.

Problem #2: we must allocate computations to the processing elements. The allocation of computations to the PEs determines which communications are required-if a value computed on one PE is needed on another, it must be transmitted over the network.

Problem #3: we must partition the functional description into computational units. Partitioning in this sense is different from hardware/software partitioning, in which we divide a function between the CPU and the accelerator. The granularity with which we describe the function to be implemented affects our choice of search algorithm and our methods for performance and energy analysis.

Problem #4: we need to map processing elements and communication links onto specific components. Some algorithms allocate functions to abstract PEs and links that do not model the behavior of a particular component in detail. This allows the synthesis algorithm to explore a larger design space using simple estimates. Mapping selects specific components that can be associated with more precise cost, performance, and power models.

In short, co-synthesis design systems need to satisfy objective functions and meet constraints. The traditional co-synthesis problem is to minimize hardware cost while satisfying deadlines. Power consumption was added to objective functions as low-power design became more important. Several groups have developed algorithms that try to satisfy multiple objective functions simultaneously

Following on the overview in Part 1 in this series, the discussion here in Part 2 , will consider various design representations including 1) methods used to specify the programs that describe the desired system behavior; 2) ways to describe the characteristics of the hardware and 3) techniques to synthesize based on a hardware architecture template.

Part 3 looks at co-synthesis algorithms for general multiprocessors that generate arbitrary hardware topologies without the use of templates. Finally in Part 4 , we study 1) multi-objective algorithms for co-synthesis, 2) co-synthesis for control and I/O systems; 3) co-synthesis techniques designed for memory-intense systems; and 4) co-synthesis algorithms for reconfigurable systems.

Program Representations
Co-synthesis algorithms generally expect system functionality to be specified in one of two forms: a sequential/parallel program or a task graph. Programs provide operator-level detail of a program's execution. Most programming languages are also sequential, meaning that synthesis must analyze the available parallelism in a program. Some co-synthesis systems accept fairly straightforward subsets of languages such as C.

Others add constructs that allow the designer to specify operations that can be performed in parallel or must be performed in a particular order; an example is the HardwareC language used by Vulcan [Gup93].

The system of Eles et al. [Ele96] takes VHDL programs as behavioral specifications; they use VHDL processes to capture coarse-grained concurrency and use VHDL operations to describe interprocess communication.

Task graphs, as we saw in Part 1, have been used for many years to specify concurrent software. Task graphs are generally not described at the operator level, so provide a coarser-grained description of the functionality. They naturally describe the parallelism available in a system at that coarser level. However, task graph variants that allow conditional execution of processes have been developed.

For example, Eles et al. [Ele98] used a conditional task graph model. Some edges in their task graph model are designated as conditional; one is labeled with the condition, called a guard, that determines whether that edge is traversed.

A node that is the source of conditional edges is called a disjunction node and a node that is the sink of conditional edges is a conjunction node. They require that for any nodes i and j, where j is not a conjunction node, there can be an edge i -> j only if it is the case that the guard for i is true, impling that the guard for j is true.

The control dependence graph (CDG) introduced in Part 1 in this series can be used to analyze the conditions under which a node will be executed. Also, Task Graphs for Free (TGFF) [Dic98c]  is a Web tool that generates pseudorandom task graphs. The user can control the range of deadlines, graph connectivity parameters, and laxity.

A very different specification was used by Barros et al. [Bar94], who used the UNITY language to specify system behavior. UNITY was designed as an abstract language for parallel programming. The assignment portion of a UNITY program specifies the desired computations.

It consists of a set of assignments that may be performed either synchronously or asynchronously on a statement-by-statement basis. Execution starts from an initial condition, statements are fairly executed, and execution continues until the program reaches a fixed point at which the program's state does not change.

Platform Representations
In addition to representing the program to be implemented, we must also represent the hardware platform being designed. Because a platform is being created and is not given, our representation of the platform itself must be flexible. We need to describe components of the platform so that those components can be combined into a complete platform.

Some of the information about the platform is closely linked to the representation of the program. For example, processor performance is usually not captured abstractly; instead, we generally capture the speed at which various pieces of the program execute on different types of PEs. One data structure that is used with some variation in many systems is the technology table (see Figure 7-7 below ).

Figure 7-7 A multiprocessor connectivity graph.
This table gives the execution time of each process for each type of processing element. (Some processes may not run on some types of PEs, particularly if the element is not programmable .)

A co-synthesis program can easily look up the execution time for a process once it knows the type of PE to which the process has been allocated.

The performance of a point-to-point communication link is generally easier to characterize since transmission time is independent of the data being sent. In these cases we can characterize the link with a data rate.

In general, a multidimensional table could have several entries for each row/ column pair, including:

1) The CPU time entry shows the computation time required for a process.

2) The communication time entry gives the time required to send data across a link; the amount of data is specified by the source-destination pair.

3) The cost entry gives the manufacturing cost of a processing element or communication link.

4) The power entry gives the power consumption of a PE or communication link; this entry can be further subdivided into static and dynamic power components.

We also need to represent the multiprocessor's structure, which in general changes during co-synthesis. A graph is often used, as shown in Figure 7-8 below , with nodes for the PEs and edges for the communication links.

Figure 7-8 A process-to-PE technology table.
Complex communication structures, such as multistage networks, are generally not directly represented in the platform description. Instead, the structure of the links shows possible connections that can be implemented using different physical layers. The platform model also does not usually include the location of the memories within the communication topology.

Template-Driven Synthesis Algorithms
Most early co-synthesis algorithms, and many co-synthesis algorithms in general, generate hardware architectures based on an architectural template.

The most common architectural template is a bus-based system, with either a single or multiple CPU and one or more accelerators. Co-synthesis that maps the design onto this bus-based template is generally known as hardware/software partitioning because the bus defines a boundary between two partitions.

This makes it possible to apply traditional graph partitioning algorithms to co-synthesis. The template provided by the CPU/bus/accelerator configuration is a powerful element of the hardware/software co-synthesis methodology. The template provides some important knowledge that restricts the design space and helps to decouple some operations including:

1) The type of CPU is known. Once we know the CPU type, we can estimate software performance much more accurately. We can also generate a great deal of performance data in advance since knowledge of the CPU type greatly limits the amount of data that we need to generate

2) The number of processing elements is known. We can more easily build performance models because we know the interconnection topology.

3) Only one processing element can multi-task. Performance analysis is more complicated when several processing elements can switch between tasks. When we move to more general architectures, co-synthesis algorithms become considerably more complex. As we will see, different algorithm designers have made different choices about simplifying assumptions and strategies to break the problem into manageable pieces.

Two early hardware/software partitioning systems exemplify different approaches to co-synthesis. COSYMA [Ern93] started with all functions on the CPU and moved some functions to the accelerator to improve performance.

Vulcan [Gup93] started with all functions in the accelerator and moved some functions to the CPU to reduce cost. Comparing these two systems gives us a good understanding of the major hardware/software partitioning problems.

The block diagram of the COSYMA system is shown in Figure 7-9 below . The application is described in CX, which extends C with some task constructs and timing constraints. That description is compiled into an intermediate form called an ES graph.

Figure 7-9 The COSYMA System.
The ES graph represents the control and data flow in the program in a way that can be rewritten either as a hardware description or an executable C program. (COSYMA also includes a simulator for ES graphs.)

The ES graph is composed of basic scheduling blocks (BSBs) that are the primary units of co-synthesis. COSYMA uses simulated annealing to search the design space to determine which BSBs should be implemented on the accelerator.

The Vulcan system started with its own programming language extension known as HardwareC. Vulcan was designed to synthesize multiple tasks running at different rates, so HardwareC allows the description of timing and rate constraints. Input and output operations are nondeterministic operations that do not take a predictable amount of time.

The HardwareC description is automatically broken into tasks by Vulcan by splitting the description at the nondeterministic operation points.  Vulcan schedules and allocates threads using iterative improvement. The initial solution puts all threads in the accelerator, which is a high-performance but also high-cost solution.

Vulcan iteratively moves threads to the CPU, testing whether the implementation still meets its performance goals after the move. Once a thread is successfully moved, its successors become the next candidates for movement.

Vulcan estimates performance and cost on the CPU by directly analyzing the threads. The latency of each thread has been predetermined; the thread reaction rate is given. Vulcan computes the processor utilization and bus utilization for each allocation. Vulcan estimates hardware size based on the costs of the operators in the function.

Because the space of feasible designs is large, co-synthesis algorithms rely on efficient search. Vahid et al. [Vah94] developed a binary constraint search algorithm. The algorithm minimizes the size of the required hardware in a hardware/software partitioned system to meet the performance constraint.

Vahid et al. formulate their objective function so as to meet a performance goal and to have hardware area no larger than a specified bound:

In this objective function, Vperf (Cj) is the amount by which performance constraint j is violated, and Varea (W) is the amount by which the hardware area exceeds the prescribed hardware size bound W.

This objective function is zero when both the performance and area bounds are satisfied. If solutions are arranged on a line, then all solutions to the right of a certain point will be zero; these are satisfying implementations.

All the nonsatisfying solutions will be to the left of that boundary. The goal of a search algorithm is to find the boundary between the zero and non-zero objective function regions.

The search algorithm is shown in Figure 7-10 below . It collects hardware (H) and software (S) elements of the system during the search procedure.

Figure 7-10 Hardware/software partitioning by binary .search.
PartAIg( ) generates new solutions to probe different regions of the objective function. The hard ware/software partition at the boundary between a zero and non-zero objective function is returned as {Hzero,Szero}.

The CoWare system [Ver96] supports the design of heterogeneous embedded systems. The system behavior is described as communicating processes.

A model can be encapsulated so that existing design tools, such as compilers and simulators, can be reused. The system description is refined over several steps to create an implementation.

A process specifies the behavior of part of the system. Processes can be described in several different host languages such as C or VHDL, and they can be defined hierarchically by composing them from other processes.

A thread is a single flow of control in a process; a process can have several threads. A slave thread is associated with a slave port. time-loop thread is not associated with a port and is repeatedly executed.

Objects communicate through ports; like processes, ports can be described hierarchically. A protocol defines the communication syntax and semantics. Protocols can also be described hierarchically.

Co-synthesis implements communicating processes, some of which execute in software on a CPU while others are implemented as hardware attached to the CPU. A library describes the CPU and its interface.

On the hardware and software sides of the interface, concurrent processes are combined using in-lining into a product process. On the software side, device drivers need to be added to the specified functionality.

Eles et al. [Ele96] compared simulated annealing and tabu search for co-synthesis. Their simulated annealing algorithm has three major parameters: initial temperature TI , temperature length TL , and cooling ratio a .

Their simple move procedure randomly selects a node to be moved from one partition to the other. Their improved move procedure randomly selects a node to be moved and also moves directly connected nodes whose movement would improve the objective function.

Tabu search uses two data structures, short-term memory and long-term memory. Short-term memory holds some information on recent search moves. Long-term memory holds information about mores over a longer time period.

For each node, it holds the number of iterations it has been in during each partition; this information can be used to determine the priority with which it should be moved.

Eles et al. uses an extended form of VHDL to describe the system to be designed: they use VHDL primitives for interprocess communication. They analyze the code to identify processes, loops, and statement blocks. They both statically analyze the code and profile the code during execution. Their co-synthesis system uses an objective function:

 

In this formula, Q1 , Q2 , and Q3 are weights. HW and SW represent the hardware and software partitions, while NH and NS represent the number of nodes in those sets, respectively.

W1E ij denotes the total amount of data transferred between processes i and j. W2E ij counts the total number of interactions between the two processes, with each interaction counting as one event. W 1i N is the total number of operations performed by process i. We define

where the Ms are weights and the Ks represent measures of the processes.

Ki CL is the relative computational load of process i, which is the total number of operations performed by a block of statements in the program divided by the total number of operations in the complete program.

Ki is the ratio of the number of different types of operations in the process divided by the total number of operations, which is a measure of the uniformity of the computation.

Ki is the  ratio of the number of operations in the process to the length of the computation path, which measures potential parallelism. KiSO is the ratio of the number of software-style operations (floating-point, recursive subroutine calls, pointer operations, and so on ) to the total number of operations.

Eles et al. concluded that simulated annealing and tabu search provided similar quality results but that tabu search ran about 20 times faster. However, tabu search algorithms took considerably longer to develop. They also compared these algorithms to Kernighan-Lin style partitioning; they found that for equalquality solutions, both simulated annealing and tabu search ran faster than Kernighan-Lin on large problems.

The LYCOS co-synthesis system [Mad97] works from a single design representation that can be derived from multiple languages and feed multiple tools. LYCOS represents designs internally as CDFGs, using a format known as Quenya, and it can translate subsets of both C and VHDL into its internal representation.

Quenya's execution semantics are based on colored Petri nets. Tokens arrive at nodes along input edges; the nodes can remove tokens from their inputs and place tokens on their outputs based on firing rules. Figure 7-11 below shows the model for an if-then-else statement.

Figure 7-11. A Quenya model for an if-then-else statement. 
The right block evaluates the condition and sends a token along the b edges. If b is true, then the sl block is executed, else the s2 block is executed.

Figure 7-12 below shows the model for a while statement. Procedures are translated into separate graphs, which call nodes denoting calls to the procedure body and an edge for each input or output parameter. Shared variables are represented by import/export nodes to sample the value of a shared variable and wait nodes for synchronization.

LYCOS uses profiling tools to estimate performance. The Quenya model is translated to C++ and executed for profiling. Hardware models are constructed in the Architectural Construction Environment (ACE), which is used to estimate area and performance.

Figure 7-12. A Quenya model for a while statement. 
LYCOS breaks the functional model into basic scheduling blocks (BSBs). BSBs for hardware and software do not execute concurrently. A sequence of BSBs –> <(BI...,Bj> … , is denoted by Si,j. . The read and write sets for the sequence are denoted by ri,j – and wi j . The speedup given by moving a BSB sequence to hardware is computed using

In this formula, pc(B) is the number of times that a BSB is executed in a profiling run and ac(v) is the number of times that a variable is accessed in a profiling run. dts —) h represents the software-to-hardware communication time, and th –> S represents the hardware-to-software communication time.

The area penalty for moving a BSB sequence to hardware is the sum of the hardware areas of the individual BSBs. LYCOS evaluates sequences of BSBs and tries to find a combination of nonoverlapping ones that give the largest possible speedup and that fit within the available area.

A sequence has some BSBs mapped to hardware and others mapped to software. With proper construction of a table, this problem can be formulated as a dynamic program. The table is organized into groups such that all members of a group have the same BSB as the last member of that sequence.

Each entry gives the area and speedup for an entry. Once some BSBs are chosen for the solution, other sequences that are chosen cannot include those BSBs, since they have already been allocated. The table allows sequences that end at certain BSBs to be easily found during search.

Xie and Wolf  used high-level synthesis to more accurately estimate performance and area of accelerators during co-synthesis. They targeted a busbased architecture with multiple CPUs and multiple accelerators.

After finding an initial feasible solution, they first iteratively reduced the number of accelerators and the cost of CPUs; they then reduced accelerator cost using more accurate analysis; finally, they allocated and scheduled tasks and data transfers.

When choosing whether to allocate a task to an accelerator or to a CPU, they used two heuristics. First, if the speed difference between the CPU and accelerator implementation was small, they chose the task as a candidate for software implementation. They also chose tasks that were not on the critical path as candidates for software implementation.

The accelerator cost-reduction phase looks at two metrics of the schedule. Global slack is the slack between the deadline and the completion of the tasks; local slack is the slack between the accelerator's completion time and the start time of its successor tasks.

This phase relies on the use of a high-level synthesis tool that, given the number of cycles an accelerator has to implement, can generate a register-transfer implementation that meets that performance goal and is also area efficient.

This phase first tries to replace each accelerator with the smallest possible implementation of that accelerator. If the schedule is still feasible, they keep that allocation, else they replace it with the accelerator's fastest implementation. They next calculate the global slack and use it to determine the amount by which accelerators can be slowed down.

Each accelerator that uses its fastest implementation is slowed down by the sum of the global and local slacks; other accelerators are slowed down by the local slack. High-level synthesis is used to generate the implementation that meets this time bound; from the synthesis result, the accelerator's area can be accurately estimated. For each accelerator, the minimum-cost implementation at the chosen speed is selected.

The Serra system [Moo00] combines static and dynamic scheduling of tasks. Static scheduling, which is handled by a hardware executive manager, is efficient for low-level tasks.

Dynamic scheduling, which is handled by a preemptive static priority scheduler, handles variations in execution time and event arrival time. Serra targets an architecture with a single CPU and multiple accelerators.

Serra represents the task set as a forest of DAGs. Each task graph has a never set that specifies tasks that cannot execute at the same time. The never set makes scheduling NP-complete; Serra finds a schedule using a dynamic programming style heuristic. The scheduler works from the end of the schedule backward, finding a good schedule given the tasks already scheduled.

To read Part 1 , go to “Reviewing the fundamentals
Next in Part 3: Cosynthesis of general multiprocessors.

This series of four articles is based on material printed with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2007 from “High Performance Embedded Computing” by Wayne Wolf. For more information about this title and other similar books, please visit www.elsevierdirect.com.

Wayne Wolf is currently the Georgia Research Alliance Eminent Scholar holding the Rhesa “Ray” S. Farmer, Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's School of Electrical and Computer Engineering (ECE). Previously a professor of electrical engineering at Princeton University, he worked at AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.