Down & dirty with HW/SW co-design: Part 3 - Cosynthesis of multiprocessors - Embedded.com

Down & dirty with HW/SW co-design: Part 3 – Cosynthesis of multiprocessors

Following on Part 1 and Part 2 in this series, we now will reconsider hardware/software co-design for more general multiprocessor architectures.

Many useful systems can be designed by hardware/software partitioning algorithms based on the CPU+ accelerator template. Hardware/software partitioning can also be used to design PEs that are part of larger multiprocessors.

But if we want to design a complete application-specific multiprocessor system, we need to use more general co-synthesis algorithms that do not rely on the CPU+ accelerator template.

In the most general case, all these tasks are related. Different partitions of functionality into processes clearly changes scheduling and allocation. Even if we chose a partitioning of functions, scheduling, allocating, and binding are closely related.

We want to choose the processing element for a process – both the general allocation and the binding to a specific type – based on the overall system schedule and when that process has to finish.

But we can't determine the schedule and the completion time of a process until we at least choose an allocation and most likely a binding. This is the Gordian knot that co-synthesis designers must face-the set of intertwined problems that must somehow be unraveled.

Kalavade and Lee [Kal97] developed the Global Criticality/Local Phase (GCLP) algorithm to synthesize complex embedded system design. Their algorithm performs the hardware/software partitioning-style task of determining whether a task should be implemented in hardware or software and the schedule of the tasks in the system.

It also allows for several different hardware and/or software implementations of a given task and selects the best implementation for the system. The application is specified as a directed acyclic graph. The target architecture includes a CPU and a custom data path.

The hardware is limited to a maximum size and the software is limited to a maximum memory size. The various hardware and software implementations of the task are precharacterized.

Each node in the task graph has a hardware implementation curve and a software implementation curve that describe hardware and software implementation costs, respectively, for various bins that represent different implementations.

Their GCLP algorithm adaptively changes the objective at each step to try to optimize the design. It looks at two factors: the global criticality and the local phase. The global criticality is a global scheduling measure.

It computes the fraction of as-yet unmapped nodes that would have to be moved from software to hardware implementations in order to meet the schedule's deadline. The global criticality is averaged over all the unmapped nodes in each step.

The local phase computation improves hardware/software mapping. Heuristics are used to determine whether a node is best implemented in hardware or software; bit manipulations have a bias toward hardware implementation and extensive memory operations have a bias toward software.

These heuristics are termed repelling forces-for example, bit manipulations repel a node away from software implementation. The algorithm uses several repeller properties, each with its own repeller value. The sum of all the repeller properties for a node gives its repeller measure.

The algorithm computes extremity measures to help optimize mappings. A software extremity node has a long runtime but has a small hardware area cost. Similarly, a hardware extremity node has a large area cost but a small software execution time.

The co-synthesis algorithm iteratively maps nodes in the task graph. After selecting a node to map, it classifies a node as a repeller, an extremity, or a normal node.

SpecSyn [Gaj98] is designed to support a specify-explore-refine methodology. As shown in Figure 7-13 below , SpecSyn transforms functional specifications into an intermediate form known as SLIF. A variety of tools can then work on the design.

A refiner generates software and hardware descriptions. SpecSyn represents designs using a program-state machine model, which uses a Statechart-like description that allows complex sequential programs at the leaf states. The SLIF representation is annotated with attributes such as area, profiling information, number of bits per transfer, and so on.

Figure 7-13 The SpecSyn system.
The allocation phase can allocate standard or custom processors, memories, or busses. Partitioning assigns operations in the functional specification to allocated hardware units.

During partitioning, performance is estimated using parameters such as bus frequency, bit width of transformations, and profiling values. Hardware size is estimated using estimates for the processing elements, busses, and so on, while software size is estimated by compiling code into generic three-operand instructions, then estimating code size for a specific processor by multiplying generic code size by a processor-code size coefficient.

A refined design adds detail but is simulatable and synthesizable, so its components can be used for further verification and synthesis. A variety of refinements can be performed, including:

Control-related refinement
This preserves the execution order of a specification when a behavior is split among several behaviors during partitioning. Signals, either unidirectional or handshaking, must be added to ensure that the two modules perform their respective operations in the proper order.

Data-related refinement. This updates the values of variables that are shared by several behaviors. For example, as shown in Figure 7-14 below , a bus protocol can be used to send a value from one processor to another. Architecture-related refinements-These add behaviors that resolve conflicts and allow data transfer to happen smoothly.

Figure 7-14 Data refinement in SpecSyn
For example, a bus arbiter can be inserted when several behaviors may use the bus at the same time. Bus interfaces are inserted when message-passing is used for data communication. Wolf [Wo197b] developed a co-synthesis algorithm that performed scheduling, allocating, and binding for architectures with arbitrary topologies.

The application is described as a task graph with multiple tasks. The set of processing elements that can be used in the design is described in a technology table that gives, for each PE and each process, the execution time of that process on each PE as well as the total cost of the processing elements. No initial template for the architecture is given. Synthesis proceeds through several steps:

Step #1. An initial processor allocation and binding is constructed by selecting for each process the fastest PE for that process in the technology table.

If the system cannot be scheduled in this allocation, then there is no feasible solution. The initial schedule tells the algorithm the rates at which processing elements must communicate with each other.

Step #2. Processes are reallocated to reduce PE cost. This step is performed iteratively by first pairwise merging processes to try to eliminate PEs, then moving processes to balance the load. These operations are repeated until the total cost of the hardware stops going down. At the end of this step, we have a good approximation of the number and types of PEs required.

Steip #3. Processes are reallocated again to minimize inter-PE communication. This step allows us to minimize communication bandwidth, which is an important aspect of design cost. At the end of this step, we have completely determined the number and types of PEs in the architecture.

Step #4. We then allocate communication channels as required to support the interPE communication.

Step #5. Finally, we allocate input/output devices, as required, to support the I/O operations required by the processes.

Xie and Wolf coupled hardware/software co-synthesis to a high level synthesis engine that used integer programming, rather than heuristic graph algorithms, to schedule and allocate hardware.

This engine was fast enough to be used as a subroutine during co-synthesis. They developed a performance analysis tool based on the high-level synthesis. At the start of synthesis, they constructed an initial solution with every task on the fastest PE, then iteratively improved the design.

The first phase performed tried to move tasks from the accelerator to CPUs and also to reduce the cost of CPUs; these procedures analyze slack to find good candidates for movement. The second phase reduces the cost of the accelerator.

The system calculates slacks in the system schedule and assigns them to various accelerators: it uses the performance analysis tool to determine whether each accelerator can be slowed down by the prescribed amount and the area of the modified accelerator.

Large sets of tasks present additional problems for co-synthesis. Dave et al. [Dav99a]developed the COSYN system to synthesize heterogeneous distributed embedded systems from large task graphs. COSYN allows task graphs to be formulated using repeated tasks: a prototype task is defined and then copied many times.

Communication systems often have many copies of identical tasks-each task represents a call. This technique is also useful in other types of large realtime systems.

COSYN task graphs are directed sets of processes. Each task graph has an earliest start time, period, and deadline. The implementation is built from PEs and communication links. Several tables and vectors define the tasks and their underlying hardware. A technology table defines the execution time of each process on each feasible PE.

A communication vector is defined for each edge in the task graph; the value of the ith entry defines the time required to transmit the data sent from source to sink over the ith type of communication link.

A preference vector is indexed by the processing element number; the ith value is 0 if the process itself cannot be mapped onto the ith PE and 1 if it can be.

An exclusion vector is indexed by process number; the ith value is 0 if the two processes cannot be mapped onto the same processing element, ith PE, and 1 if they can co-exist on the same processor.

An average power vector for each process defines the average power consumption of the task on each type of PE. A memory vector for each process consists of three scalar values: program storage, data storage, and stack storage. There is one memory vector for each task.

The preemption overhead time is specified for each processor. The set of processes allocated to the same PE is called a cluster. COSYN will adjust the periods of tasks by a small amount (3% by default) to reduce the length of the hyper period.

The number of executions of the task in the hyper period causes problems: each execution must be checked. By adjusting the period of a task that has many copies in the hyper period, COSYN can reduce the number of times it must consider that task in the hyper period. COSYN uses this value to rank tasks for period adjustment. Threshold reduction stops when the number of total executions of a task in the hyper period fall below a threshold.

COSYN also uses an association graph to manage the representation of multiple executions of a task in the hyper period. If all tasks have deadlines shorter than their periods, then the association graph need only be one-dimensional; its entries include the PE to which the task is allocated and its priority, deadline, best-case projected finish time, and worst-case projected finish time.

If a task's deadline can extend beyond its period, then the association graph must be two-dimensional because there may be more than one instance of the task executing at a time. The second dimension indexes over the simultaneously executing copies of the task.

The COSYN synthesis algorithm is summarized in Figure 7-15 below. COSYN performs three major phases: (1) It clusters tasks to reduce the search space for allocation; (2) it allocates tasks to processing elements; and (3) the tasks and processes are scheduled.

Figure 7-15 The COSYN synthesis algorithm.
COSYN forms clusters to reduce the combined communication costs of a set of processes that is on the critical path. This is done by putting the set of processes on the same PE.

It first assigns a preliminary priority to each task based on its deadline. It then uses a greedy algorithm to cluster processes: It starts with the highest-priority task, puts it in a cluster, and walks through the fan-in of the task to find further tasks for the cluster. To avoid overly unbalanced loads on the PEs, a threshold limits the maximum size of a cluster.

After clusters are formed, they are allocated, with allocation decisions driven by the dollar cost of the allocation. Clusters are allocated in order of priority, starting with the highest-priority clusters. COSYN checks for the compatibility of connections between PEs during allocation.

After allocation, the processes are scheduled using priorities. COSYN concentrates on scheduling the first copy of each task. The association array is used to update the start and finish times of the remaining copies of the task; other copies need to be explicitly scheduled only when an execution slot is taken by a higher-priority task.

COSYN allows the supply voltages of parts to be mixed so that some noncritical sections of the design can be run at lower power levels. It checks the compatibility of connections between components at different supply voltages.

COSYN evaluates the feasible allocations by comparing the sum of their worst-case finish times. It chooses the solution with the largest total worst-case finish times for the processes, since the longest finish times are usually associated with the lowest-cost hardware.

COSYN tries to allocate concurrent instances of tasks to allow pipelining. As shown in Figure 7-16 below , several instances of a task may execute concurrently; the instances of each process in the task are identified by a superscript.

Figure 7-16 Allocating concurrent task instances for pipelining.
If we allocate t1 to PE A, t2 to PE B, and t3 to PE C, then we pass data from process to process in the task at low cost and pipeline the execution of the task instances.

Dave and Jha [Dav98] also developed methods for hierarchical co-synthesis. As shown in Figure 7-17 below , their COHRA system both accepts hierarchical task graphs and produces hierarchically organized hardware architectures.

A hierarchical task is a node in the task graph that contains its own task graph. A hierarchical hardware architecture is built from several layers of PEs that are composed in an approximate tree structure, though some non-tree edges are possible.

Figure 7-17 Hierarchical specifications and architectures.
COHRA's synthesis algorithm resembles COSYN's algorithm: clustering, then allocation, then scheduling. COHRA also uses association tables to keep track of multiple copies of tasks.

Dave and Jha [Dav99b] developed the COFTA system to co-synthesize fault tolerant systems. They used two types of checks at the task level, specifically:.

Assertion tasks. They compute assertion operations, such as checking an address against a valid range, and issue an error when the asserted condition is false.

Compare tasks. They compare the results of duplicate copies of tasks and issue an error when the results do not agree.

The system designer specifies the assertions to be used; duplicate copies of tasks and the associated compare task are generated for tasks that do not have any assertions.

The authors argue that assertion tasks are much more efficient than running duplicate copies of all tasks. An assertion task can catch many errors in a task without requiring that the entire computation of the task be repeated.

Figure 7-18 Simple 1-by-n protection in a failure group.
COFTA uses the clustering phase to enforce a fault-tolerant task allocation. It assigns an assertion overhead and fault tolerant level to each task.

The assertion overhead of a task is the computation and communication times for all the processes in the transitive fanin of the process with an assertion. The fault tolerance level of a task is the assertion overhead of the task plus the maximum fault tolerance level of all processes in its fanout.

These levels must be recomputed as the clustering changes. During clustering, the fault tolerance level is used to select new tasks for the cluster-the fanout task with the highest fault tolerance level is chosen as the next addition to the cluster.

COFTA tries to share assertion tasks whenever possible. If several tasks use the same assertion but at nonoverlapping times, then one assertion task can be used to compute the assertion for all those tasks, saving hardware resources.

COFTA also generates a fault-tolerant hardware architecture. Figure 7-18 above shows a simple 1-by-n failure group that consists of n service modules that perform the actual work of the system and one protection module. The protection module checks the service modules and substitutes results for bad service modules. In general, a failure group may use m-by-n protection with m protection modules. COFTA uses a restricted form of subgraph isomorphism to build failure groups into the architecture.

Next in Part 4: Multi-objective optimization.
To read Part 1 , go to “Reviewing the fundamentals .”
To read Part 2 , go to “Co-synthesis algorithms.

This series of four articles is based on material printed with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2007 from “High Performance Embedded Computing” by Wayne Wolf. For more information about this title and other similar books, please visit www.elsevierdirect.com.

Wayne Wolf is currently the Georgia Research Alliance Eminent Scholar holding the Rhesa “Ray” S. Farmer, Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's School of Electrical and Computer Engineering (ECE). Previously a professor of electrical engineering at Princeton University, he worked at AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.