The four Rs of efficient system design - Embedded.com

The four Rs of efficient system design

New design languages and new chips and systems mean a whole new set of design gotchas for today's developers. Once-simple tasks become difficult and, thankfully, once-difficult tasks become easy. This article for senior designers looks at newer high-level design techniques and how they can improve logic and system design.

New FPGA chips are approaching ASIC-like density and performance, with their inherent cost advantages and reprogrammability clearly in their favor. With embedded DSP and CPU cores now available for use in FPGAs, these programmable logic devices are a real alternative for many embedded systems designers.

Many designers from the ASIC world are turning to FPGAs for new designs. According to research firm Gartner Dataquest in its Market Trends report “ASIC and FPGA Suppliers Answer the Call,” more than 74,000 “design starts” used FPGAs in 2004, but only around 4,000 used ASICs. It's not easy to switch from an ASIC to an FPGA design flow, however. True, complex FPGA design shares some features with ASIC design, but under the hood, many of the steps are fundamentally different. The prebuilt nature of FPGAs encourages a “use it or lose it” mentality regarding features and capabilities. FPGA design, more often than ASIC design, must therefore match the functional requirements to the chip itself.

As high-end FPGAs encroach on ASIC performance, ASIC design techniques are being adapted for FPGA design. Two such techniques, physical synthesis for high-performance timing closure and C++ synthesis for C-based design, can illustrate the subtleties involved. The algorithms for C++ synthesis are the same for ASICs and FPGAs. By leveraging “technology-aware” synthesis (through technology-specific libraries), the same design can be implemented in either or both types of silicon fabric. The fact that C++ specifications aren't tied to the specific hardware is considered a primary advantage. To make full use of physical synthesis, however, you need a tool that understands the FPGA's internal hardware structure.

We'll first introduce today's conventional hardware-design flow and examine its associated problems. We'll explain alternative approaches to hardware design using C/C++, comparing the pros and cons of timed design languages with those of untimed, or algorithmic, methods. Toward the end, we'll explain why you must consider interconnect delay and physical effects in the design process to achieve optimal performance.

Algorithmic C synthesis
The conventional flow for high-end electronic designs involves handcrafting Verilog or VHDL representations. These manual methods were effective in the past but many of today's new designs are so complex that traditional design practices are now inadequate. Creating register transfer level (RTL) implementations for high-end FPGAs has become as time-consuming as ASIC design.

A way around this is to design, simulate, and synthesize C representations. By using pure untimed C++ to describe functional intent, engineers can move up an abstraction level for designing hardware, reducing design time, creating a more repeatable design flow, and preserving the option of implementing the design in either ASIC or FPGA. An added benefit is that by exploring multiple microarchitectural solutions, engineers can often produce better designs than those created through traditional RTL methods.

Traditional hardware design
Many high-end designs in the communications or video/image processing industries rely on extremely complex algorithms. The first step in a conventional design flow involves modeling and proving the design functions at the algorithmic level of abstraction, using tools such as MATLAB or plain C/C++ modeling.

MATLAB works well for validating and proving the initial algorithm, although many design teams also develop C/C++ models to verify that the whole system meets functional and performance specifications. For subsequent discussion, we'll use the term untimed algorithm to represent those algorithms written either in MATLAB or pure ANSI C/C++.

Based on project requirements, system architects then partition the design into blocks of hardware or software. Each hardware block's function is represented by a floating-point algorithm. In this case, either the system designer or the hardware designer quantizes the floating-point algorithm into an integral or fixed-point representation. These fixed-point algorithms are represented in MATLAB using Simulink or in untimed C++ using bit-accurate types.

After validating the fixed-point algorithm, the hardware designer starts the manual process of creating Verilog or VHDL for the RTL abstraction. We can subdivide this process into three distinct phases:

  • Defining microarchitecture. The system architects decide on the structure of the data path, control, and interfaces. They typically do this on paper or perhaps an Excel spreadsheet. The resulting microarchitecture has a significant effect on the overall speed and area of the hardware. Based on the decisions made in this step, designs can easily vary by 10 times in area or performance.
  • Writing the RTL design. The hardware engineers then manually write the RTL to represent the defined microarchitecture.
  • Optimizing RTL area/timing. The hardware engineers iterate through RTL synthesis and RTL code modifications to meet design goals.

The hardware engineers manually translate the floating-point untimed algorithm into bit-accurate RTL, either Verilog or VHDL. This RTL is subsequently synthesized into a gate-level netlist using traditional RTL-synthesis tools. The main problems associated with this traditional flow are:

  • Communicating functional intent. There is a significant conceptual and representational divide between the system architects working with untimed algorithms and the hardware designers working with the timed RTL in VHDL/Verilog. As a result, the original design intent specified by the system architect can easily be misinterpreted, causing functional errors in the end product. And while it's pretty easy to implement and evaluate specification changes in the untimed algorithm, it's painful and time-consuming to subsequently fold these changes into the RTL. This is a serious consideration in communications applications, because broadcast standards and protocols frequently evolve and change.
  • Meeting design goals. Predicting design performance (area, delay, power) is difficult until the RTL is done. Therefore, system-level partitioning and the resulting block-level design goals are inaccurate at best. Many system-level timing-closure problems are directly related to poor macroarchitectural choices and unrealistic goals unwittingly placed on the engineers designing the hardware blocks.
  • Dealing with design complexity. Because the untimed algorithmic domain and RTL domain are dissimilar, the manual translation from untimed algorithms to RTL is prolonged and error-prone. In addition, RTL uses technology-dependent coding styles that “hard-codes” the microarchitecture. It's impractical to evaluate alternative implementations because modifying and re-verifying RTL for “what-if” analyses of alternate microarchitectures is too lengthy to be practical. Such evaluations may include performing certain operations in parallel versus sequentially; pipelining portions of the design versus nonpipelining; or sharing common resources. Because of the amount of time involved, design teams are limited by the number of evaluations they can perform, which can result in a suboptimal implementation. Complexity of high-end, compute-intensive applications exemplifies the difficulties associated with traditional hand-coded RTL.
  • Reusing RTL. Using the same RTL for an ASIC and an FPGA means the ASIC implementation is likely suboptimal because of the FPGA's performance limitations. Conversely, users can meet their performance goals in an FPGA through massive parallelism that might not be necessary for an ASIC. This makes it difficult, if not impossible, to retarget a complex RTL design for an optimal implementation in both technologies.
  • Verifying functionality. Using traditional logic simulation to verify a large design represented in RTL is compute-intensive and slow.

The most important challenge facing the design team is that all of the implementation “intelligence” associated with the design is hard-coded into the RTL, which therefore becomes rigid and implementation-specific.

C flow—the next generation
As we've seen, the shortcomings of the typical RTL design flow (shown in Figure 1) are the inability to explore the design space and the time it takes to write, verify, and synthesize the RTL.


Figure 1: The ideal design flow depicted on the right is based on algorithmic synthesis of pure, untimed C/C++, which addresses the problems associated with the traditional flow (shown on the left) where the untimed algorithm is hand-translated into RTL

The ideal flow should be based on industry-standard ANSI C/C++, which has been the language of choice for software and system-level modeling for many years. The pure, untimed C/C++ written by system designers is an excellent source for creating hardware because it's devoid of implementation details. After verification, the hardware engineer uses a C synthesis tool to automatically generate optimized RTL from the C/C++ representation. The RTL output is then used to drive existing RTL-synthesis tools as shown in Figure 1. With this flow, you can synthesize the untimed C/C++ directly into a gate-level netlist. This maximizes flexibility and provides a source that is “malleable,” that is, capable of targeting ASICs, FPGAs, highly compact small solutions, and highly parallel fast solutions. The translation from MATLAB to C/C++ is still manual, but because these domains are conceptually very close, the translation is relatively quick and easy.

Using untimed C/C++ adds a lot of value by providing much faster simulation than the MATLAB Simulink environment, and is therefore ideally suited for system-level validation. Moreover, generating the intermediate RTL provides a timed “comfort zone” for existing flows by allowing you to validate the implementation decisions made by the C synthesis tool.

Furthermore, RTL is a useful point to stitch the various functional blocks together. Large portions of today's designs exist in the form of IP blocks delivered as RTL. This means RTL is a useful point in the design flow for integrating and verifying the entire hardware system. Design teams can take full advantage of existing RTL-design tools for test insertion or power analysis, for example. The ideal flow based on algorithmic synthesis of pure, untimed C/C++ addresses all of the traditional bottlenecks:

  • Communicating functional intent. Almost no conceptual gap exists between the system architects and the hardware designers because both use the same untimed C/C++ source. Their worlds are connected for the first time. Moreover, it eliminates any chance of misinterpretation by the hardware designer, reducing errors and improving reliability. The new flow also accommodates changes to the design specification.
  • Meeting requirements. Algorithmic C synthesis provides accurate estimates up front that can be used to make system-level macroarchitecture decisions, the better to meet system-performance goals. This avoids lengthy RTL synthesis iterations since the algorithmic C tool can generate RTL code and constraints together.
  • Dealing with design complexity. You can address the design complexity by moving up in abstraction. Algorithmic C is quick and efficient to create and verify, which has benefits for system-level validation and integration. RTL uses technology-dependent coding styles and hard-codes the microarchitecture. Ideally, evaluating alternative implementations would be fast and efficient. You can modify and re-verify C to effectively perform a series of “what-if” evaluations of alternative algorithms. Thus, your design teams aren't limited by the number of evaluations they can perform.
  • Reusing RTL. A key feature of this flow is that the C representation is completely independent from the final implementation. Therefore, instead of embedding implementation “intelligence” into the C representation, you can use such intelligence to drive the C-to-RTL implementation through a series of “soft” constraints. In turn, this means that you can easily re-target the same C representation for different microarchitectures and ASIC or FPGA implementations.
  • Verifying functionality. Verifying C is fast and efficient. A pure untimed C representation will simulate as much as 10,000 times faster than an equivalent RTL representation (the larger the design, the faster C is compared with its RTL counterpart).

SystemC
The SystemC language provides a comprehensive verification environment that allows C++ designs to be simulated at mixed levels of abstraction. It uses C++ class libraries to model hardware structures such as modules, ports, interfaces, and concurrency.

Designers use SystemC for system-level verification, but the complexity of the language creates barriers for system and hardware designers alike. Using SystemC at the RTL provides little (if any) value over VHDL or Verilog as shown in Figure 2. The value comes in at higher levels of abstraction and is useful for system-level verification, but as yet there is no consensus on what should and should not be synthesizable in the SystemC language. It's clear that coding interface definitions in SystemC removes the ability to easily make interface tradeoffs since this requires complex changes to the C++ source (for example, a dual-port memory interface is substantially different from a CPU interface). Editing SystemC models is not an effective way to explore architectural alternatives.


Figure 2: To make a behavioral or RTL SystemC representation suitable for RTL generation or direct C synthesis, you would need to write it at nearly the same level of abstraction as hand-translated RTL

SystemC synthesis can be accomplished by using SystemC data types with pure, untimed C++. This “algorithmic SystemC” source is the highest abstraction of SystemC and provides the greatest value to the end user (technology independent, interface independent, microarchitecture independent). Adding the ability to generate a cycle-accurate SystemC model enables an algorithmic C synthesis tool to benefit from the SystemC verification environment, yet avoid issues of hard coding technology intent in SystemC descriptions.

Handel-C
Handel-C is typical of the home-grown C-based simulation and synthesis languages developed by universities and EDA companies. It preserves traditional C syntax and control structures, making it easy for C programmers and hardware designers to understand. In addition to hardware-centric datatypes, Handel-C also includes special keywords/extensions that facilitate dataflow representations and support parallel programming. This flow involves manually translating the untimed algorithm into Handel-C. Following verification via simulation (which requires Celoxica's compiler), the Handel-C representation is directly synthesized into a gate-level netlist as shown in Figure 3.


Figure 3: You may end up taking as much time creating adequate Handel-C as you would hand-creating the RTL, thereby nullifying the advantage of C-based design flows

Using a proprietary language means users cannot use alternative simulation or synthesis tools. As a result, many engineers prefer standards-based alternatives. Theoretically, the manual translation of MATLAB to Handel-C should be relatively painless because the Handel-C representation is close to pure C. In practice coercing Handel-C to adequately capture the design in a form suitable for the synthesis engine requires intensive work by an expert user.

Here again, the pseudo-timing constructs required for the synthesis and simulation of Handel-C representations are foreign to both system-level and hardware designers. All of the implementation “intelligence” associated with the design has to be hard-coded into the Handel-C, which therefore becomes implementation-specific. Furthermore, users have minimal control over the Handel-C synthesis engine, which is something of a “black box” to work with and which doesn't take advantage of the target technology (for example, the engine takes no account of elements like multipliers and RAM blocks in an FPGA). This implies some nonintuitive manipulation of the C code to achieve speed and size requirements. In short, design teams may end up taking as much time creating adequate Handel-C as they would hand-creating the RTL, thereby nullifying the advantage of C-based design flows.

Higher synthesis abstraction
As we noted previously, the most significant problem with existing C-based design flows is that the implementation “intelligence” associated with the design has to be hard-coded into the C representation, which then becomes implementation-specific. Ideally, the C code should be virtually identical to what a system designer would write to model functional behavior without any preconceived hardware-implementation or target-device architecture in mind.

Instead of adding intelligence to the source code (thereby locking it into a target implementation), all of the intelligence should be provided by controlling the synthesis engine itself with user-defined constraints. New tools are available that use C++ source code augmented with SystemC data types, which allow specific bit-widths to be associated with variables and constants. An advantage is that many companies already create an untimed C/C++ representation of their designs for algorithmic validation. They do this because a pure C representation is easy and compact to write and simulates 100 to 10,000 times faster than an equivalent RTL representation.

The only modification typically required in newer C-based design tools is to add a single pragma to the source code to indicate the top of the functional portion of the design—anything conceptually above this point is considered part of the test bench. Once the tool has read the source code, the designers can immediately perform microarchitecture tradeoffs and evaluate their effects in terms of size and speed. Ideally, all of these evaluations must be done within a few seconds or minutes, depending on design size. Total size/area must be reported along with latency in terms of clock cycles or input-to-output delays (or, in the case of pipelined designs, throughput time/cycles). Ideally, a C synthesis tool should be able to name, save, and reuse any of these “what-if” scenarios. Conventional, iterative, hand-coded RTL flows would make it almost impossible to perform these tradeoffs in a timely manner.

More importantly, the fact that the C source code isn't required to contain any implementation “intelligence”—all such intelligence is supplied by constraints to the synthesis engine itself—means that design teams can easily retarget the same source code to alternative microarchitectures and different implementation technologies. The fundamental difference between the various C-based design flows is the level of synthesis abstraction they support as shown in Figure 4.


Figure 4: C-based synthesis design flows that support a higher level of synthesis abstraction accelerate implementation time and increase design flexibility when compared with other C-based flows

Physical synthesis for FPGAs
Achieving timing closure in the shortest number of design cycles is a huge FPGA design challenge. Timing closure solutions using standalone logical synthesis and place-and-route (P&R) are iterative and nondeterministic by nature. Many alternatives have been proposed. Physical synthesis is one technique that helps designers quickly close on timing compared with other methods such as floorplanning, random modifications to constraints, or repeated place-and-route iterations. Without physical synthesis, designers typically write and rewrite RTL code, provide guidance to the P&R tools by grouping cells, and possibly attempt some floorplanning. An alternative is simply to make numerous P&R runs. Usually, the RTL code/constraints are modified with only some heuristic notion that these changes will improve timing. Designers must iterate through P&R—the most time-consuming step in FPGA design—before learning whether the changes were a step in the right direction or only served to worsen the problem. This unpredictability reduces the cost benefits and time-to-market advantages of using programmable logic in the first place.

Synthesis routines and decisions that are driven by knowledge of the physical layout of the target device tend to achieve a much better result than those that only perform logical synthesis. To reduce design iterations and improve accuracy, the design team must consider interconnect delay and physical effects up front. The following sections outline some of the ASIC-strength algorithms that are effectively used in optimizing complex FPGA designs.

Beyond logic synthesis and floorplanning
In the ASIC or FPGA world, the word “synthesis” instinctively means RTL logic synthesis. This is the approach of a traditional FPGA synthesis tool: first synthesize, then perform technology mapping.

Cell delay (propagation delay) was the dominant delay factor in older FPGAs and the traditional approach to logic synthesis was good enough to meet the timing requirements in FPGA designs. Reducing the number of logic levels or reducing a function's area reduced cell delay and so timing was met. Unfortunately, this formula doesn't extend to the new FPGAs. As we validate Moore's Law and process technology shrinks accordingly, transistors get smaller and more of them fit into a single chip. The timing bottleneck in an FPGA design has shifted from exclusively cell delays to include interconnect delays. In fact, in the newer generations of FPGA designs, net delays regularly exceed 70% of the total delay.

Synthesis routines with the goal of producing the most efficient logic don't guarantee better performance after the design has been placed and routed. This is because traditional synthesis methods estimate the route delay based on wire-load models (WLM) . A WLM is number typically calculated based on a net's statistical estimated delay, using factors such as parasitics and fanout. Optimization decisions are based on identifying the critical (usually longest) logic path. A wire-load estimate will in many cases identify a different critical path than the one that really exists after technology mapping (in other words, after place and route). This means a significant amount of performance is still left on the table.

Floorplanning is a proven and successful approach for ASIC design, but has its limitations when used in FPGAs. In ASICs, routing is not predetermined, so designers can do intelligent preplacement floorplanning to reduce wire lengths and minimize delay. For FPGAs, the prebuilt routing fabric creates specific structural limitations. Fanout-based delay estimates in FPGAs don't model even a simplified version of this physical reality, so calling them timing “estimates” is optimistic.

To use an example, consider a highly utilized FPGA with a large number of paths that are missing timing and these paths involve several blocks of the design. The traditional method of solving these timing problems was to use pre-place-and-route floorplanning. Without any timing analysis capability, this process proves to be painfully iterative. Since each place-and-route takes a few hours, the quickest time to view the results after a floorplanning change was most of a day. As such, this process can end up taking weeks or months to meet timing goals. Even then, the physical constraints created by the floorplanning process cannot be transferred to later revisions of the same design.

Physical synthesis tools are commonly mistaken for floorplanners. Floorplanning is a process aimed exclusively at efficient handling of large multimillion-gate designs. While both tools have a unique identity, you can achieve clear benefits when using their capabilities in tandem. A user could be allowed to floorplan sections of a design and run physical synthesis on the individual modules until the desired performance is achieved. This “divide and conquer” approach coupled with the predictability of performance saves a lot of time over using the traditional FPGA flow.

The Four Rs
Tying in the RTL-synthesis tool to its physical-synthesis counterpart can produce a sizable jump in both performance and productivity. Unlike with a standalone physical-synthesis tool, the designer's visibility doesn't stop at the post-synthesis technology level. Instead, the designer is able to cross-probe all the way up to the RTL source code. For instance, designers challenged with larger, high-end FPGA designs today should be able to perform timing analysis on a complex design and extend this functionality by effectively cross-probing between the timing report and the physical view of the design. This way, a design bottleneck can be solved either at the RTL or the physical level (by recoding if necessary). This gives the user a significant amount of control when analyzing a design, be it at the RTL, constraint, technology, or physical level.

Many physical-synthesis techniques that are used for ASICs (resizing drivers and buffering signals, for example) don't work well for FPGAs. FPGA optimizations must take advantage of the device's internal structure. Physical-synthesis algorithms fall into four categories: retiming, replication, re-placement, and resynthesis. Let's look at each of them.

Register retiming
Register retiming is one of the strongest algorithms for improving timing—when it can be done. Reductions of up to 15% in overall critical path timing are not uncommon. Though retiming is also done during logic synthesis, the retiming performed during physical optimization is more effective because it uses more accurate timing information.

Even in a circuit with very critical timing, there are many paths that easily meet their timing goals. Excess time available for data propagation, or slack, is unevenly distributed, with some circuit paths having negative slack while some have positive slack. Register retiming finds situations where slack on one side of a register is positive, while slack on the other side is negative as shown in Figure 5. Under the right conditions it's possible to move the register, effectively moving some of the delay through the register, without affecting the functionality of the design at the primary output ports as illustrated in Figure 6. Ideally, the result will be positive slack on both sides of the register.


Figure 5: A simple circuit before retiming


Figure 6: The circuit in Figure 5 after retiming

Retiming can only occur if it's possible to perform the transformation without modifying the function of the circuit at the boundary pins. An important consideration here is to maintain the initial state (reset state) of the register. Additionally, it's important that design latency not change—the same number of register stages must exist before and after retiming. For retiming to work, the following must be true:

  • Registers must have positive slack on one side and negative slack on the other
  • All logic cones into a candidate LUT (lookup table) must be registered for a move to be possible (this maintains latency)
  • All registers to be moved must have consistent clock signals
  • Registers with both set and reset inputs cannot be retimed
  • Registers with clock enables will be converted to MUX-feedback logic to allow retiming

While control signals must stay consistent to allow retiming, control signals may change based on the function of the combinatorial logic. Figure 7 illustrates an example of retiming a register through inverting logic. Note that the tool must implement the original reset signal as set logic to maintain the same initial state at the outputs.


Figure 7: Register retiming changes the reset connection to preset to preserve initial states

Moving registers forward during retiming is better than moving registers backwards. Backward retiming is more expensive since it nearly always means adding extra registers to the design. But some timing problems can only be attacked with backward retiming.

Pipeline stage insertion is similar to register retiming with some differences. Inserting pipeline stages or registers into a design is a common way to manually fix timing problems with RTL code changes. Here, the designer no longer needs to restructure the logic to make it appropriate for pipelining, since the register retiming algorithm allows the tool user to infer extra registers at the start or end of the path, and automatically distributes the registers throughout the logic to maximize performance.

To distribute pipeline stages, the retiming algorithm must work differently. Figure 8 shows a typical example. The optimum circuit would result by inserting the two registers on the right at Point A and Point B. But once the first register is inserted at point B, any further move is impossible because moving it in either direction would increase circuit slack. Pipeline retiming must move the register the maximum distance, even if slack is temporarily increased. Physical synthesis automatically adjusts to using pipeline retiming rules when it finds serial registers in the design.


Figure 8: Pipeline retiming must move the register the maximum distance, even if slack is temporarily increased

Register replication
In the world of register-rich programmable logic it's common to hear designers say, “registers are free in FPGAs.” For that reason register replication is a common technique used by all FPGA optimization tools. Logic-synthesis tools have typically used register replication to control signal fanout. But with the ability to control pl

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.