Transitioning from C/C++ to SystemC in high-level design -

Transitioning from C/C++ to SystemC in high-level design

In high-level design, high-level code is put through a series of steps on its way to becoming register transfer level (RTL) code. The first step, algorithm design, is usually done in C or C++, where the high-level code that describes how the system will function is created. To be implemented in hardware, this high-level code must be converted to RTL code, using a synthesis tool. It's almost never the case, however, that high-level synthesis using the result of the algorithm design phase will produce a desirable RTL implementation. An architecture design phase that precedes high-level synthesis is required in order to produce RTL code with the desired characteristics.

Making a translation to SystemC for this step has become the preferred high-level design method. In this article, I'll give some examples of steps taken in the architecture design phase that can help you achieve good RTL code.

High-level design has many advantages over the more commonplace design flow that begins with RTL code. Among the most compelling advantages is the improved verification efficiency that a higher level of abstraction offers. It's apparent to the point of being self-evident that when the source code of a design is created, fewer errors occur if the source is at a higher abstraction level than if it is at a lower level. However, a process is still required to verify the transformations that are applied to the design description as it proceeds through the design flow from creation to final realization.

Figure 1 shows the initial steps in the high-level design (HLD) flow. Figure 2 shows the flow with the verification steps and the design loops added. A verification step after each design step can result in a loop back to fix a design error. Another loop after each of the two synthesis steps can return you to the architecture design step, where you'll improve whatever relevant design criteria have not been met. This loop back to the architecture design step is a key part of the high-level design process.

Click on image to enlarge.

Click on image to enlarge.

Architecture design
The architecture design step in Figure 1 is the fundamental “hardware design” step in a high-level design flow. High-level design, particularly when done using “Plain old C,” is often presented as a process where the designer simply takes a C algorithm, runs it through a high-level synthesis tool, and gets high-quality RTL. This high-quality result, however, almost never happens.

The architecture design step is where hardware design is done for a fundamental reason. Algorithms developed in C are software representations, and there are very different costs and benefits between hardware and software implementations.

Some of the fundamental differences between software and hardware implementations are:

• Hardware is hierarchical, and the communication between subunits (subroutines in software, modules in hardware) is different. While only a few calling conventions implement software interfaces, there are many more possibilities for hardware interfaces. Plain old C (PoC) provides call by value, which, along with the ability to pass pointers, suffices for subroutine calls. However, there is no facility for representing the myriad variations of hardware interfaces. For example, ready-valid, trigger-done, AMBA bus read/write, custom bus-there are often good reasons for a hardware implementation to use a custom interface.

• The cost of memory access in hardware is much higher than in software. Consequently, PoC algorithms often simply store a lot of data in memory, then access it in whatever fashion is convenient for the algorithm. By contrast, the same algorithm implemented in hardware can often be done with a much smaller window of the data being processed at a time, and either written once or passed on to another processing phase. Image processing algorithms are usually done this way. The algorithm in software will simply iterate over the whole image, while the hardware implementation will work on a smaller window that moves over the image, as Example 1 demonstrates.)

Click on image to enlarge.

• Conditional expressions are cheap in software but can be expensive in hardware. This is illustrated when accessing array elements with a variable index when the array is mapped onto dedicated hardware registers, shown in Example 2 .

Click on image to enlarge.

Design optimization
It's convenient to use a different language for the architecture design step rather than Plain old C, because C does not provide the ability to represent a few key hardware-related features, notably hierarchy, data-path widths, and concurrency. SystemC was developed for just this reason. SystemC is not a separate language, but simply a C++ class library that provides the missing features. It's far easier to do architecture design in SystemC than it is to do it in C.

Example 1 shows a C algorithm as it would be written for software implementation and how it would be modified for hardware implementation. Making the change to SystemC from the original C is a small step compared with the iterative process that takes place while refining the hardware implementation.

Example 2 shows a SystemC fragment that adds two elements from an array, where each operand is half the array apart. The straightforward original code results in two big muxes when implemented in hardware. The less obvious modified code results in a much smaller hardware implementation.

The examples show transformations made to source code to produce a better hardware result out of high-level synthesis. These are examples of optimization where the modified code will result in a smaller, more efficient hardware implementation when high-level synthesis is applied. In many cases, the design process is one of successive optimizations in order to get a hardware implementation that is as small, fast, and/or power efficient as possible. This is a manual process, where the hardware designer uses his knowledge and experience to make the appropriate code transformations.

One can ask why such a manual process is necessary, since the promise of high-level synthesis is to remove the drudgery of mapping higher-level operations to hardware. The answer is that there are many ways to implement an algorithm in hardware, and no single choice is the best under all design constraints. Some transformations, it could be argued, should always be done, and could thus be implemented in the high-level synthesis tool, as in Example 2 . However, there are a great many examples of code where the high-level synthesis tool could in principle recognize the original without requiring the code to be rewritten but in practice requires the designer to do the transformation. No high-level synthesis tool will recognize them all (at least within our lifetimes), which means the hardware designer will always have a job.

Design exploration
The architecture design step is also the step where design exploration is done. Exploration has been recognized as one of the great benefits of doing high-level design, since it's relatively easy to control the high-level synthesis tool to produce implementations that have different properties, typically area, latency, or power consumption. We usually think of this process as adding either global or local directives to the synthesis tool. Examples of global directives are “unroll all the loops” and “flatten all the arrays.” An example of a local directive would be “set the latency of this block to be three cycles.” It's easy to see that design exploration can be accomplished by changing these directives, for example, “unroll loop A but not loop B,” “map arrays to memories,” or “set the latency to five cycles.”

However, there are occasions where changing the source code can be useful for design exploration. Example 3 , a 4×4 matrix multiply, shows such a case.

Click on image to enlarge.

This is a 4×4 matrix multiplication, where the data type of the matrix elements is “Mtype.” Mtype could be changed from integer to fixed point to floating point, and this code would not change. The conditional code (“#if FASTER … #else … #endif” ) shows two different implementations of the inner-most loop. In the first version of elementloop, a multicycle pipelined part can be created that will do four multiply operations and three additions, producing a resulting matrix element every cycle, after an initial latency. This code required hard-coding the three addition operations, so if DIM were a value other than four, that line of code would have to be changed. The second version of elementloop results in a much smaller implementation, using a single multiply and addition that will produce a resulting matrix element every four cycles. The synthesis directives to unroll the loops, flatten the arrays, pipeline the inner loops, and create custom datapath functional units are not shown.

Example 3 shows two alternatives to writing the inner loop of the matrix multiply, and it shows the difference between writing code as software and writing it for hardware implementation. Executed on a standard processor, both of these versions would have the same performance, since there is only one multiplier and one adder available. However, when using high-level synthesis, the synthesis tool is able to use more than one multiplier and more than one adder in the first case. In the second case, the synthesis tool would use just one multiplier and one adder (or a combined multiply-accumulate functional unit).

A new step toward quality
The architecture design step is where hardware design is done. The designer uses this step to experiment with different RTL architectures, and then uses it to optimize the implementation after the architecture has been decided. Both architecture exploration and optimization can be done by varying the synthesis directives or by changing the high-level source code. n

John Sanguinetti is chief technology officer at Forte Design Systems and has been active in computer architecture, performance analysis, and design verification for 20 years. After working for DEC, Amdahl, ELXSI, Ardent, and NeXT computer manufacturers, he founded Chronologic Simulation in 1991 and was president until 1995. He was the principal architect of VCS, the Verilog Compiled Simulator and was a major contributor to the resurgence in the use of Verilog in the design community. Dr. Sanguinetti served on the Open Verilog International Board of Directors from 1992 to 1995 and was a major contributor to the working group which drafted the specification for the IEEE 1364 Verilog standard. He was a cofounder of Forte Design Systems. He has 15 publications and one patent. He has a Ph.D. in computer and communication sciences from the University of Michigan.

Additional reading:
•Black, D. and J. Donovan. SystemC: From the Ground Up. Springer, New York, NY. 2006.

•Kurup, P., T. Abbasi, and R. Bedi. It's the Methodology, Stupid! ByteK Designs, Inc. Palo Alto, CA. 1998.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.