HW/SW co-verification basics: Part 3 - Hardware-centric methods - Embedded.com

HW/SW co-verification basics: Part 3 – Hardware-centric methods

As we have seen in Part 1 and Part 2 , there are benefits and drawbacks of using software models of microprocessors and other hardware. This section discusses techniques that avoid model creation issues by using a representation of the microprocessor that doesn't depend on an engineer coding a model of its behavior.

As the world of SoC design has evolved, the design flows used for microprocessor and DSP IP have changed. In the beginning, most IP for critical blocks such as the embedded microprocessor were in the form of hard IP. The company creating the IP wanted to make sure the user realized the maximum benefit in terms of optimized performance and area. The hard macro also allows the IP to be used without revealing all of the source code of the design. As an example, most of the ARM7TDMI designs use a hard macro licensed from ARM.

RTL Model of CPU with Software Debugging
Today. most SoC designs don't use hard macros but instead use a soft macro in the form of synthesizable Verilog or VHDL. Soft macros offer better flexibility and eliminate portability issues in the physical design and fabrication process.

Now that the RTL code for the CPU is available and can easily be run in a logic simulator or emulation system. everybody wants to know the best way to perform co-verification. Is a separate model like the instruction set simulator really needed?  It does not seem natural to most engineers (especially hardware engineers) to replace the golden RTL of the CPU, the representation of the design that will be implemented in the silicon, with something else. The reality is that the RTL code can be used for co-verification and has successfully been used by project teams.

The drawback of using the RTL code is that it can only execute as fast as the hardware execution engine it is running on. Since it is totally inside the hardware execution engine, there is no chance to take any simulation short cuts that are possible (or automatic) with host-code execution or instruction set simulation.

Historically, logic simulation has always been too slow to make the investigation of this technique interesting. After all, a simulation environment for a large SoC typically runs less than 100 cycles/sec and running at this speed it is not possible to use a software debugger to perform interactive debugging.

The primary area where this technique has seen success is with simulation acceleration and emulation systems that are capable of running at much higher speeds. With a hardware execution engine that runs a few hundred kHz up to 1 MHz it is possible to interactively debug software running on the RTL model of the CPU.

To perform co-verification with an RTL model of the microprocessor, a software debugger must be able to communicate with the CPU RTL. To debug software programs, a software debugger requires only a few primitive operations to control execution of a microprocessor. This can best be seen in a summary of the GNU debugger (gdb) remote protocol requirements. To communicate with a target CPU gdb requires the target to perform the following functions:

1) Read and write registers
2) Read and write memory
3) Continue execution
4) Single step
5) Retrieve the current status of the program (stopped, exited, and so forth)

In fact, gdb provides an interface and specification called the remote protocol interface that implements a communication channel between the debugger and the target CPU to implement the necessary functionality to enable gdb to debug a program.

On a silicon target where a chip is placed on a board, the only way to communicate with gdb to send and receive the protocol information is by adding some special software to the user's software running on the embedded processor that will communicate with gdb to send information such as the register contents and memory contents.

The piece of code added to the software is called a gdb stub. The stub (running on the target) communicates with gdb running on a different machine (the host) using a serial port or an Ethernet connection. While this may seem complicated, it is the easiest way to debug without requiring the CPU to provide provisions in silicon for debugging.

The good news is that for simulation acceleration and emulation applications there is much greater flexibility since it is really a simulation of the CPU RTL code and not a piece of silicon. The difference is visibility. In silicon there is no visibility.

There is no way to see the values of the registers inside without the aid of software to export the values or special purpose hardware to scan out the values. Simulation, on the other hand, has very good visibility. In a simulation acceleration or emulation platform, all of the values of the registers and wires are visible at all times.

This visibility makes the use of the gdb remote protocol even better than its original intent since a special stub is no longer needed by the user in the embedded system code. Now the solution is totally transparent to the user. Now gdb can use the remote protocol specification to talk to the simulation, both of which are programs running on a PC or workstation.

This technique requires no changes to gdb, and the work to implement it is contained in the simulation environment to bridge the gap between gdb and the data it is requesting from the simulation. The architecture of using the gdb remote protocol with simulation acceleration and emulation is shown in Figure 6.18 below .

Figure 6.18: gdb Connected to the RTL Code of the microprocessor

Hardware Model with Logic Simulation
Another way to eliminate the issues associated with microprocessor models is to use the concept of a “hardware model.” A hardware model uses the silicon of the microprocessor as a model for Verilog and VHDL simulation. A custom socket holds the silicon and captures the outputs from the silicon sends them to a logic simulator and applies the inputs from the simulator to the input pins of the silicon.

The communication mechanism between the hardware modeler and the simulator must involve software to talk to the simulator so a network connection is most natural. The concept is much like that of a tester where the stimulus and response is provided by a logic simulator. The architecture of using the hardware model for co-verification is shown in Figure 6.19 below.


Figure 6.19: Hardware Model of the Microprocessor
Software debugging with the hardware model can be accomplished in multiple ways. In the previous section, the gdb stub was presented. This is a technique that can be used on the hardware model to debug software. Unlike the RTL model in a simulation environment, the hardware model cannot provide visibility of the internal registers so the user must integrate the stub with the other software running on the microprocessor.

The other technique for debugging software is a JTAG connection for those microprocessors that support this type of debugging by providing dedicated silicon to connect to the JTAG probe and debugger. In both cases, performance of the environment can limit the utility of the hardware model for software debugging.

The hardware model can also provide local memory in the hardware to service some memory requests that are not required to be simulated. For pure software development, software engineers are interested in high performance and less interested in simulation detail.

By servicing some of the memory requests locally on the hardware modeler and avoiding simulation. the software can run at a much higher speed. Hardware modelers can run at speeds of up to 100 kHz when running independently of the logic simulator. Of course, in the lock step mode they will only run as fast as the logic simulator and exchange pin information every cycle.

With the hardware model, co-verification is no longer completely virtual since a real sample of the microprocessor is used, but for those engineers that have negative experiences with poor simulation models in the past, the concept is very easy to understand and very appealing. What could be a better model than the chip itself?

Clocking limitations are one of the main drawbacks of the hardware model. To do interactive software debugging, the CPU must be capable of running slowly and maintaining its state. Early hardware modeling products were developed at a time when many microprocessor chips started using phase-locked loops and could not be slowed down because the PLLs don't work at slow speeds.

To get around this problem, the hardware modeler would reset the device and replay the previous n vectors to get to vector n + I. This allowed the device to be clocked at speeds high enough to support PLL operation, but made software debugging impossible, except by using waveforms from the logic simulator.

As we have seen, today's microprocessors come in two flavors, the high-performance variety with PLLs and those more focused on low power. The high-performance variety usually have mechanisms to bypass the PLL to enable static operation and the low-power variety are meant for static design and are very flexible in terms of slow clocking and even stopping the clock.

Unfortunately, experiments with such processors have revealed that when bypassing the PLL, device behavior is no longer 100% identical to behavior with the PLL. For low-power cores like ARM, irregular clocking can also be trouble since it requires the clock input to be treated more like a data input since it must be sampled in simulation and is not required to be regular.

With the RTL core becoming more common, there are now products that provide an FPGA for the synthesizable CPU and link to the logic simulator in the same way as the more traditional hardware modeler. Using the CPU in an FPGA gives some benefit by allowing JTAG debugging products to be used, but performance is still likely to be a concern. If the JTAG clock can run independently of the logic simulator, high performance can be obtained for good JTAG debugging.

Evaluation Board with Logic Simulation
The microprocessor evaluation board is a popular way for software engineers to test code before hardware is available. These boards are readily available for a reasonable cost. To extend the use of the evaluation board for co-verification, the board can serve a similar purpose as the instruction set simulator.

Since most boards have networking support, a socket connection between the board and the logic simulator can be developed. A bus functional model residing in the logic simulator can interface the board to the rest of the hardware design. The architecture of using the evaluation board for co-verification is shown in Figure 6.20 below.


Figure 6.20: Microprocessor Evaluation Board with Logic Simulation
This combination of a CPU board connected to logic simulation via a socket connection and BFM is most appealing to software engineers since the performance of the board is very good. Since each is running independently, there is no synchronization or correlation between the two time domains of the board and the logic simulator.

The drawback to this type of environment is the need to add custom software to the code running on the CPU board to handle the socket connection to the logic simulator. Some commercial co-verification vendors provide such a library that may be suitable, but must always be modified since each board is different and the software operating environment is different for different real-time operating systems. Although the solution requires a lot of customization, it has been used successfully on projects.

In-Circuit Emulation
In-circuit emulation involves using external hardware connected to an emulation system that runs at much higher speeds than a logic simulator. Emulation is an attractive platform to do co-verification since the higher speed enables software to run faster. This section discusses three different ways to perform co-verification with an emulation system.

The first method is useful for microprocessor cores that are available in RTL form. As we have seen, there is a trend for the IP vendors to provide RTL code to the user for the purposes of simulation and synthesis. If this is available, the microprocessor can be mapped directly into the emulation system.

Most cores used in SoC design today support some kind of JTAG interface for software debugging. To perform co-verification a software engineer can connect a JTAG probe to the I/O pins of the emulator and communicate with the CPU that is mapped inside the emulator. The architecture of using a JTAG connection to an emulator for co-verification is shown in Figure 6.21 below.


Figure 6.21: JTAG Connection to an Emulation System
In this mode of operation, the CPU runs at the speed of the emulation system, in lock-step with the rest of the design. The main issues in performing co-verification are the overall speed of the emulator and its ability to maintain the JTAG connection reliably at speeds that are lower than most hardware boards.

A second way to perform co-verification with an emulation system is to use a board with the microprocessor test chip and connect the pins of the chip to the 1/O pins of the emulator. This technique is useful for hard macro microprocessor IP such as the ARM7TDMl that cannot be mapped into the emulation system. JTAG debugging can also be done by connecting to the JTAG port on the chip. The architecture of using a JTAG connection to an emulator for co-verification is shown in Figure 6.22 below.


Figure 6.22: JTAG Connection with Test Chip and Emulation System
Like the previous method, the CPU core will run at the speed of the emulation system. Signal values will be updated on each clock cycle. The result is a cycle-accurate simulation of the connection between the test chip and the rest of the design.

The cycle-accurate lock-step simulation is desired for hardware engineers that want to model the system exactly and want to run faster using emulation technology for long software tests and regression tests.

In both of the previous techniques, the user must make sure to confirm that the JTAG software and hardware being used for debugging can tolerate slow clock speeds. Most emulation systems run in the 250 kHz to 1 MHz range depending on the emulation technology and the design being run on the emulator.

While this is much faster than a logic simulator, it is much slower than what the developers of the JTAG tools probably expected. Most JTAG tools have built in time-outs, either in the hardware or in the software debugger (or both) for situations when the design is not responding. It is crucial to verify that these time-outs can be turned off.

Emulation, like simulation, allows the user to stop the test by pressing Ctrl+c, waiting for some unspecified amount of time, and then restarting operation. If time-outs exist in the JTAG solution, this will certainly cause a disconnect and result in the loss of software debugging.

The best way to provide a stable JTAG connection is to use a feedback clock to the JTAG hardware to help it adapt its speed based on the speed of the emulation system. The third co-verification method commonly used with emulation is to use a speed bridge between hardware containing a microprocessor device and the emulation system.

The classic case for this application is for verification of a chip that connects to the PCI bus. A common setup is for software engineers that are developing device drivers for operating systems such as Windows or Linux and the board they are writing the driver for sits on the PCI bus.

Since the PCI board is not yet available, they can use a PC to test the software and the emulation system provides a PCI board that plugs into the PC and bridges the speed differences between the real speed of the PCI bus in the PC (33 or 66 MHz) and the slower speed of the emulator.

The PC will run at full speed until the device driver makes a memory or I/O access to the slot with the hardware being developed. When this occurs, the bridge to the emulator will detect the PCI transaction and send it over to the emulator.

While the emulator is executing the PCI transaction, the bridge card will continuously respond with a retry response to stall the PC until the emulator is ready. Eventually, the emulator will complete the PCI transaction and the bridge card will complete the transaction on the PC. This method is shown in Figure 6.23 below.

Similar environments are common for embedded systems where a board containing a microprocessor can run an RTOS such as VxWorks and communicate with the emulator through a speed bridge for a bus such as PCI or AHB.

Figure 6.23: JTAG Connection Speed Bridge and Emulation System

FPGA Prototyping
I always get a laugh when the FPGA prototype is discussed as a co-verification technique. Prototyping is really just building the system out of programmable logic and using the debugger just as if the final hardware was constructed.

The only difference may be ASICs are substituted for FPGA, and as a result the performance is lower than the final implementation. Since hardware debugging is very difficult, prototyping barely qualifies as co-verification, but since the representation of the hardware is not the final product it is a useful way for software engineers to get early access to the hardware to debug software.

Recent advances in FPGA technology have caused many projects to reexamine hardware prototyping. With FPGAs from Altera and Xilinx now exceeding 250 k to 500 k ASIC gates, custom prototyping has become a possibility for hardware and software integration.

Until now design flow issues, tool issues, and the great density differences between ASIC and FPGA have limited the use of prototyping. With the latest FPGA devices, most ASICs can now be mapped into a set of one to six FPGAs. New partitioning tools have also been introduced that work at the RT level and do not require changes to the RTL code or difficult gate-level. post-synthesis partitioning.

Although prototyping is easier than it has ever been it is still not a trivial task. Prototyping Issues fall into two categories: FPGA resource issues and ASIC/FPGA technology differences. Common resource issues can be the limited number of I/O pins available on the FPGA or the number of clock domains available in an FPGA.

Technology differences can be related to differences in synthesis tools forcing the user to modify the design to map to the FPGA technology. Another common technology issue is gated clocks that are difficult to handle in FPGA technology.

If resource and technology issues can be overcome, prototyping can provide the highest performance co-verification solution that is scalable to large numbers of software engineers. Before committing to prototyping is it important to clearly understand the issues as well as the cost.

On the surface, prototyping appears cheap compared to alternatives, but like all engineering projects cost should be measured not only in hardware but also in engineering time to create a working solution.

Next in Part 4: Co-verification metrics.
To read Part 1 , go to “Determining what and how to verify .
To read Part 2 , go to “Software-centric co-verification methods.

This series of articles by Jason Andrews is from “Embedded Software know it all” edited by Jack Ganssle, used with permission from Newnes, a division of Elsevier. Copyright 2008. For more information about this title and other similar books, please visit www.elsevierdirect.com.

Jason Andrews, author of Co-verification of Hardware and Software ARM SoC Design , has implemented multiple commercial co-verification tools as well as many custom co-verification solutions. His experience in the EDA and embedded marketplace includes software development and product management at Verisity, Axis Systems, Simpod, Summit Design. and Simulation Technologies. He has presented technical papers and tutorials at the Embedded Systems Conference, Communication Design Conference, and IP/SoC and written numerous articles related to HW/SW co-verification and design verification. He has a B.S. in electrical engineering from The Citadel, Charleston, S.C., and an M.S. in electrical engineering from the University of Minnesota.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.