Down & dirty with HW/SW co-design: Part 4 – Multi-objective optimization - Embedded.com

Down & dirty with HW/SW co-design: Part 4 – Multi-objective optimization

Embedded system designs must meet several different design criteria. The traditional operations research approach of defining a single objective function and perhaps some minor objective functions, along with design constraints, may not adequately describe the system requirements.

Economist Vilfredo Pareto  developed a theory for multi-objective analysis known as Pareto optimality. An important concept provided by this theory is the way in which optimal solutions are judged: an optimal solution cannot be improved without making some other part of the solution worse.

D'Ambrosio and Hu [D'Am94] developed the GOPS system to co-synthesize real-time embedded systems; they called this level of design configuration synthesis because it does not use detailed models of the processing elements.

GOPS uses a feasibility factor to estimate the schedulability of a set of tasks. Feasibility is determined relative to the speed of the PE and is defined in terms of several throughput factors. The upper-bound throughput TRU is an upper bound on the throughput required for the PE to finish the task. If ratemonotonic scheduling is used, then TRU for a set of N tasks is

where ci is the computation time, di is the deadline, and ai is the activation time, respectively, for task i. If earliest-deadline-first scheduling is used, then

 

However, because these bounds are loose, D'Ambrosio and Hu provided some tighter bounds. They first showed that a set of n tasks arranged in ascending order of deadlines cannot be feasibly scheduled if

where

They also showed that

and

Based on these results, they showed that TRL can be computed as

The actual throughput required to finish all tasks on time is TRp, which lies between TRL and TRU. They define the feasibility factor in terms of these throughput estimates:

During optimization, the feasibility factor can be used to prune the search space. Candidate architectures with negative feasibility factors can be eliminated as infeasible. The feasibility factor is also used as an optimization objective. The two criteria of GOPS are cost and feasibility factor.

Genetic algorithms have also been used for co-synthesis. This class of algorithms is modeled on genes and mutations. A solution to the problem is represented by a string of symbols. Strings can be modified and combined in various ways to create new designs. There are three basic types of moves in genetic algorithms:

1. A reproduction step makes a copy of a string.
2. A mutation step randomly changes a string.
3. A crossover step interchanges parts of two strings.

The new designs are then evaluated for quality and possibly reused to create still further mutations. One advantage of genetic algorithms is that they can easily handle complex objective functions.

Dick and Jha [Dic97; Dic98] developed a genetic algorithm for hardware/ software co-synthesis. Their algorithm can generate architectures with multiple processors and communication links; the architecture is optimized for both performance and power consumption. The synthesis procedure characterizes PEs as either independent or grouped. A processor can perform only one task at a time; an IC can have multiple cores on one chip so that a chip can simultaneously run several tasks.

The tasks are characterized by technology tables that give the worst-case execution time and power consumption of each task on each feasible processor. It also keeps arrays that map given the worst-case performance, average power consumption, and peak power consumption for each task on each feasible core.

A communication link has several attributes: packet size, average power consumption per packet, worst-case communication time per packet, cost, number of contacts (the number of points connected by the link), number of pins, and idle power consumption. Busses are modeled as communication links with more than two contacts.

MOGAC keeps track of multiple objective functions and ranks designs on each objective function. The designer also specifies hard constraints on design characteristics. The genetic model for the design has several elements:

1) The processing element allocation string lists all the PEs and their types that are allocated in the architecture. The grouped processing element allocation string records the allocation of grouped PEs.

2) The task allocation string shows which processing element has been assigned to each task.

3) The link allocation string shows how communication within the tasks is mapped to communication links. The link connectivity string records how chips and independent PEs are connected to communication links.

4) The IC allocation string records how tasks are assigned to chips.

5) The IC allocation string shows how PEs are allocated to chips; in general, the architecture can include multiple chips with different numbers of PEs on each chip.

The MOGAC optimization procedure is summarized in Figure 7-19 below. MOGAC constructs an initial solution and then repeatedly performs the evolveevaluate cycle.

Figure 7-19 The MOGAC optimization procedure
The evaluation phase, consisting of evaluation and ranking, determines which solutions are noninferior-that is, which solutions are as good as any other for some design objective. Non-inferior solutions are more likely to be selected for evolution, but some inferior solutions will be selected as well.

If the optimization procedure is not done, some low-rank solutions are terminated while other high-rank solutions are replicated. This pool of solutions is modified using crossover and mutation.

To reduce the time required to search the design space, MOGAC organizes possible solutions into clusters. Each member of a cluster has the same allocation string but different link allocations and assignments. Some operations are applied to the cluster and affect every member of the cluster. Other operations are applied only to a single solution within the cluster.

Real-time performance constraints are treated as hard constraints. The system's real-time constraint violation is the sum of all the violations of all nodes in the system's task graphs. Price and average power consumption are treated as soft constraints. Solutions are ranked by their Pareto-rank, which is the number of other current solutions that do not dominate it in the multi-objective space. The rank of a cluster is defined by

where nis(x) is the set of non-inferior solutions in x and dom(a,b) is 1 if a is not dominated by b and 0 if a is dominated by b.

A variable called solution selection elitism is used to weight the probability that a solution is selected for reproduction; its value monotonically increases during the optimization procedure. Near the end of its run, MOGAC uses greedy procedures to help it converge on local minima. The number of crossovers and mutations per generation are specified by the user.

Yang et al. [Yan01] developed an energy-aware task scheduling algorithm for multiprocessors. Their method combines design-time and runtime scheduling methods. At design time, a scheduler evaluates different scheduling and allocation choices for the set of threads to run on the multiprocessor.

The cost function is described as a Pareto curve over the performance-energy space. They use a genetic algorithm to find a schedule and assignment for the tasks. Yang et al. generate a table that is used at runtime to select the best schedule. They use heuristic algorithms to select which of the available scheduling/allocation patterns should be used.

Dick and Jha [Dic04] developed the COWLS system to co-synthesize clientserver systems that used low-power clients communicating over wireless links to servers. Allocation of wireless links and scheduling of data transfers over those links is an important part of this problem, since wireless links are both relatively expensive and consume large amounts of energy.

COWLS uses a parallel recombinative simulated annealing approach similar to that used in MOGAC. Clusters of candidate solutions are analyzed, mutated, and recombined to search a large solution space.

It “Pareto-ranks” candidate solutions by finding whether one cluster of candidates is better than or equal in all solution dimensions to all the members of another cluster.

It uses three costs to guide mutation of solutions: communication time, computation time, and utilization. COWLS ranks candidate PEs during mutation by these costs; the candidate processing elements are sorted by their ranks and mutation favors highly ranked PEs.

Scheduling helps determine both the power consumption and timing of the solution. COWLS uses slack to assign task priorities, with tasks that have little slack assigned to high priorities.

Inter-task communication on the same PE is considered to be free, while communication between PEs uses the wireless links. The scheduler allocates communications to wireless links. The scheduler models bus contention.

Control and I/O Synthesis
Control is often central to communication between processes. Control operations also dominate most I/O routines. Control-dominated systems pose different challenges for co-synthesis.

The control finite-state machine (CFSM) model [Chi94] was developed to model control-dominated systems. In contrast to traditional FSMs, which operate synchronously, CFSMs have finite, non-zero, unbounded reaction times. The inputs and outputs of CFSMs are events. CFSMs are used as an intermediate form for languages such as Esterel, hardware description languages, and so on. This results in a network of communicating CFSMs.

Design partitioning assigns each component machine to either hardware or software implementation. A hardware implementation of a CFSM uses combinational logic guarded by latches to implement the next-state and output functions.

A software implementation is created by generating another intermediate form known as an s-graph, then translating that model into C code. An s-graph is a DAG that is a reduced form of a control-flow graph.

Chou et al. [Cho98] developed a modal process model as a framework for the description and implementation of distributed control. A modal process can be in one of several modes; its UO behavior depends on its current mode as well as the input events presented to it.

Abstract control types (ACTS) define control operations with known properties. Examples of abstract control types include unify, which keeps the modes in several processes the same; mutual exclusion; and preemption. The mode manager interprets the constraints on process modes and implements ACT call requests. A mode manager can be implemented in either centralized or distributed form.

Chou et al. [Cho95b] developed a methodology to synthesize interfaces for embedded controllers. The I/O behavior is represented as control flow graphs. The designer manually selects which input/output tasks are to be implemented in hardware or software.

They first generate hardware or software implementations of the I/O tasks. Next, they allocate I/O ports to the processes, which may require adding multiplexers to share ports.

The algorithm can split an operation into several steps if the device data is wider than the bit width of the I/O device. If not enough I/O ports are available, they implement some devices using memory-mapped input/output. They then generate an 1/O sequencer that ensures devices meet their response time and rate requirements.

Daveau et al. [Dav97] took an allocation-oriented approach to communication synthesis. A library describes the properties of the available communication units: the cost of a component, the protocol it implements, the maximum bus rate at which data can be transferred, a set of services it provides, and the maximum number of parallel communication operations it can perform simultaneously.

These authors model the system behavior as a process graph. Each abstract channel has several constraints: the protocol it wants to use, the services it provides to the processes, and its average and peak transfer rates.

A communication unit can implement an abstract channel if it provides the required services, uses the right protocol, and has a large enough maximum data rate to satisfy at least the channel's average communication rate.

Synthesis builds a tree of all possible implementations, then performs a depth-first search of the tree. An operation allocates several abstract channels to the same communication unit.

Memory Systems
Memory accesses dominate many embedded applications. Several co-synthesis techniques have been developed to optimize the memory system.

Li and Wolf [Li99] developed a co-synthesis algorithm that determines the proper cache size to tune the performance of the memory system. Their target architecture was a bus-based multiprocessor.

They used a simple model for processes in the instruction cache. A process occupied a contiguous range of addresses in the cache; this assumes that the program has a small kernel that is responsible for most of the execution cycles.

They used a binary variable x. to represent the presence or absence of a process in the cache: ki = 1 if process i is in the cache and 0 otherwise.

Their algorithm uses simulation data to estimate the execution times of programs. Co-synthesis would be infeasible if a new simulation had to be run for every new cache configuration.

However, when direct-mapped caches change in size by powers of two, the new placement of programs in the cache is easy to determine. The example in Figure 7-20 below shows several processes originally placed in a 1 KB direct-mapped cache.

Figure 7-20 How code placement changes with cache size.
If we double the cache size to 2 KB, then some overlaps disappear but new overlaps cannot be created. As a result, we can easily predict cache conflicts for larger caches based on the simulation results from a smaller cache.

During co-synthesis, they compute the total system cost as

where C(x) is the cost of component x.

The authors use a hierarchical scheduling algorithm that builds the full hyperperiod schedule using tasks, then individually moves processes to refine the schedule.

Co-synthesis first finds an allocation that results in a feasible schedule, then reduces the cost of the hardware. To reduce system cost, it tries to move processes from lightly loaded PEs to other processing elements; once all the processes have been removed from a PE, that processing element can be eliminated from the system.

It also tries to reduce the cost of a PE by reducing its cache size. However, when processes are moved onto a PE, the size of that processing element's cache may have to grow to maintain the schedule's feasibility.

Because the execution time of a process is not constant, we must find a measure other than simple CPU time to guide allocation decisions. Dynamic urgency describes how likely a process is to reuse the cache state to reduce misses:

In this formula, SU is the static urgency of a task, or the difference between the execution time and its deadline; the worst-case execution times are measured relative to the current cache configuration.

Wuytack et al. [Wuy99] developed a methodology for the design of memory management for applications, such as networking, that require dynamic memory management. Their methodology refined the memory system design through the following several steps.

Step #1. The application is defined in terms of abstract data types (ADTs).
Step #2. The ADTs are fined into concrete data structures. The proper data structures are chosen based on size, power consumption, and so on.
Step #3. The virtual memory is divided among one or more virtual memory managers. Data structures can be grouped or separated based on usage. For example, some data structures that are lined to each other may be put together.
Step #4. The virtual memory segments are split into basic groups. The groups are organized to allow parallel access to data structures that require high memory performance.
Step #5. Background memory accesses are ordered to optimize memory bandwidth; this step looks at scheduling conflicts.
Step #6. The physical memories are allocated. Multiport memories can be used to improve memory bandwidth.

Co-synthesis for Reconfigurable Systems
FPGAs are widely used as implementation vehicles for digital logic. One use of SRAM-based FPGAs is reconfigurable systems-machines whose logic is reconfigured on-the-fly during execution.

As shown in Figure 7-21 below , an FPGA may be able to hold several accelerators; the logic for these accelerators is embedded in the two-dimensional FPGA fabric.

Figure 7-21 Successive configurations of an FPGA.
The configuration can be changed during execution to remove some accelerators and add others. Reconfiguration during execution imposes new costs on the system.

1) It takes time to reconfigure the FPGA. Reconfiguration times may be in the milliseconds for commercial FPGAs. Some experimental FPGA architectures can be reconfigured in a small number of clock cycles.

2) Reconfiguration consumes energy.

3) Not all combinations of accelerators can be accommodated in the FPGA simultaneously. Scheduling is influenced by which combinations of accelerators can fit at any given time.

Determining the feasibility of a schedule is much more difficult for a reconfigurable system than for a traditional digital system. If we want to schedule a task at a given time, we have to first determine whether its accelerator is already resident in the FPGA.

If not, we have to determine what combination of accelerators must be removed to make room for the new one (a task known in IC design as floorplanning [Wo102]). Reconfiguration time must be added to the schedule's execution time and reconfiguration energy must be added to the cost of the schedule.

CORDS [Dic98b] , uses an evolutionary algorithm to synthesize programs onto reconfigurable platforms. Its basic algorithms are similar to those used by MOGAC. CORDS adds reconfiguration delay to the costs evaluated during optimization reconfiguration delay must be adjusted as the state of the schedule changes.

The dynamic priority of a task is equal to the negation of the sum of its slack and its reconfiguration delay. CORDS increases the dynamic priority of tasks with low reconfiguration times to encourage several similar tasks to be scheduled together, reducing total reconfiguration time.

The Nimble system [L100] performs fine-grained partitioning that exploits instruction-level parallelism to map an algorithm onto a reconfigurable platform, which consists of an embedded CPU coupled to a reconfigurable data path.

The details of the platform are configurable and are described in an architecture description language. The program to be implemented is represented as a control flow graph. Loops in the program may contain multiple kernels; profiling information is attached to the basic blocks and loop kernels.

The execution time of a loop that is implemented in whole or in part in hardware depends on several factors: execution time of the hardware loop, execution time of any software portion of the loop, communication time between hardware and software, and the time required to configure the hardware unit on the FPGA.

Configuration time depends on program state-if the hardware unit had been previously instantiated and not removed by subsequent activity, then no configuration time is required.

Synthesis concentrates on interesting loops that consume most of the execution time. Given an interesting loop, the portions of that loop selected as hardware candidates are determined by the execution and communication time, but not the configuration time, since the order of hardware loop body invocations is not yet known.

Inter-loop selection determines the overall hardware/software allocation based on global costs. The synthesis algorithm walks through a graph of the loops and procedures; this graph is similar to the control dependence graph of Ferrante et al. [Fer87] . An example is shown in Figure 7-22 below.


Figure 7-22 An example loop-procedure hierarchy graph.
This graph helps identify loops that compete. Two loops that have different first-level predecessors do not conflict in the trace and could be put into different clusters. Loops that share a common predecessor may compete with each other and are preferentially placed in the same cluster.

Loops are clustered based on a bottom-up walk through of the loop-procedure hierarchy graph, with the size of each cluster limited by a parameter based on the size of the FPGA.

Hardware/Software Co-simulation
Even after we verify the hardware and components of a system independently, we need to make sure that the components work together properly. Because hardware/software systems are both large and operate at many different time scales, powerful verification tools are needed to identify improper behavior and help the designer determine the origin of a bug.

Designers should run the hardware against the software to find bugs on both sides of the interface. Hardware/ software co-simulation gives the designer traditional debugging tools; co-simulators also run fast enough to give satisfactory turnaround times for design experiments.

Co-simulators provide mechanisms that allow different types of simulators to communicate. A brute force approach to simulating software that talks to hardware would be to simulate a register-transfer implementation of the CPU along with the custom hardware, setting the software bits as state in the CPU simulation and running the software by exercising the RTL model.

Even if we have a register-transfer model of the processor, which is not always the case, this approach would be unreasonably slow. Because we have cycle-accurate simulators that run considerably faster than RTL models of processors, we can use them to execute software and simulate only the custom hardware using traditional event-driven hardware simulation.

As shown in Figure 7-23 below, a simulation backplane is the mechanism that allows different simulators to communicate and synchronize. Each simulator uses a bus interface module to connect to the backplane. The backplane uses concurrent programming techniques to pass data between the simulators.

Figure 7-23 Hardware/software co-simulation using a simulation backplane.
The backplane must also ensure that the simulators receive the data at the proper time. The bus interface includes controls that allow the backplane to pause a simulator. The backplane controls the temporal progress of the simulators so that they see data arriving at the correct times.

Becker et al. [Bec92] built an early co-simulator to simulate a large network system. They used the programming language interface (PLI) of the Cadence Verilog-XL simulator to add C code that could communicate with software simulation modules.

They used UNIX networking operations to connect the hardware simulator to the other elements of the system being simulated, the firmware, and a monitor program. They did not simulate a processor at the cycle level.

Ghosh et al. [Gho95] later built a more general simulation environment that also used the Verilog-XL PLI to coordinate the hardware and software simulators over a backplane. Their simulator included a CPU simulator. Zivojnovic and Meyr [Ziv96] compiled software instructions to be simulated into native instructions on the host simulation processor; they attached these models to a hardware simulator for co-simulation.

Seamless Co-verification
The Mentor Graphics Seamless system simulates heterogeneous hardware/software systems. Hardware modules can be described using standard hardware description languages; software can be loaded into the simulator as binary code or as C language models. A bus interface model connects the designer's hardware modules with the processor's instruction set simulator.A coherent memory server allows some parts of the memory to be shared between hardware and software models while isolating other parts of memory; only shared memory segments need to be mediated by the co-simulation framework. A graphic profiler allows the designer to visualize system behavior.

Conclusion

Hardware/software co-design's goal is to search a very large design space to find reasonable system architectures for application-specific systems. To plausibly formulate this design space, let alone efficiently search it, we typically make some assumptions about the design.

We may work from a template, use a library of predesigned components, or use particular objective functions to drive a search. Even a few assumptions make the design problem much more tractable.

Hardware/software co-design is not a push-button solution to embedded system design.We still need the whole range of hardware and software tools to implement many of the components of the embedded system identified by cosynthesis. Co-synthesis, however, can help organize the search for reasonable architectures.

This series of two articles is based on material printed with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2007 from “High Performance Embedded Computing” by Wayne Wolf. For more information about this title and other similar books, please visit www.elsevierdirect.com.

Wayne Wolf is currently the Georgia Research Alliance Eminent Scholar holding the Rhesa “Ray” S. Farmer, Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's School of Electrical and Computer Engineering (ECE). Previously a professor of electrical engineering at Princeton University, he worked at AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.