Basics of core-based FPGA design: Part 2 – System design considerations -

Basics of core-based FPGA design: Part 2 – System design considerations

There are a number of system design factors requiring consideration when implementing an FPGA processor. Some of those factors include the use of co-design, processor architectural implementation, system implementation options, processor core and peripheral selection, and implementation of hardware and software.

Embedded software development has the potential to consume 50% or more of embedded processor design schedules. Thus, it is important to have and follow a cohesive hardware and software development flow on a rapid system development project. This important collaboration between hardware and software design teams can help to streamline and parallel development.

The parallel development of hardware and software is called co-design. Effective co-design is important to implementing an efficient rapid system development effort. Co-design has the potential to impact many of the elements associated with embedded project development, supporting increased system flexibility and reduced schedule.

The system design tool chain can be critical to efficient co-design. The tool chain is the collection of hardware and software tools used for design entry, simulation, configuration and debug. An effective tool chain will provide a high level of interaction and synchronization between the hardware and software tool sets and design files. Figure 14.3 below illustrates the interactions and relationships between the two tool flows.

Figure 14.3. FPGA co-design flow
In evaluating co-design tools, two of the most important factors affecting the selection are tool maturity and ease of use. The embedded FPGA processor software tool chain should include a software development kit (SDK), which supports efficient development of low level drivers, and a range of operating system implementations. The hardware tools should support the efficient integration of IP and hardware and software debug synchronization. Some desirable co-design tool characteristics are presented in the following list.

Desirable Co-Design Tool Characteristics 

1) Automated tools that hide the details but keep them accessible

– Intelligent tools must understand all details of the platform options, but provide a high level of abstraction to streamline design and synchronize hardware and software components.

– Tool sophistication targets design complexity

2)  Tool functions that can accelerate development
– Wizards and generators

3) Easy to learn and use
– Intuitive user-friendly interface

4)  Supports complete control of the design
– Robustness to change/control without the loss of flexibility

5) Powerful integrated debug capabilities 

6) Integrated baseline control capability

Processor architecture alternatives
Since the RISC architecture is arguably the most implemented processor architecture, this book will limit discussions to the RISC architecture. When designing with a RISC-based processor, there are many architectural considerations affecting hardware and software design optimization. This section will highlight some of the RISC architectural considerations.

Achieving optimal system performance (required throughput) is a critical element of embedded processor design implementation. Optimal system performance is accomplished by informed design implementation of the hardware and software. Processor architecture is a critical factor that determines system performance. Understanding the architecture of the processor selected will assist the design team in making informed design decisions.

The RISC architecture increases processor performance by imposing single cycle instruction execution. This point is clarified by considering Equation 14.1 below ,  a common equation used to derive a processor’s performance. If the number of cycles per instruction are reduced in this equation, the processor performance is increased.

However, this increase in performance comes as a consequence of an increase in the number of instructions required to implement a software program, and thus an increase in the software program size. The result of the larger software program size is an increase in the number of external memory operations, which serves to reduce system performance.

Factors that influence system performance optimization include: processor core implementation, bus implementation and architecture, use of cache, use of a memory management unit (MMU), interrupt capability, and software program flow.

Equation 14.1. Basic RISC Processor Formula

The processor core is responsible for the overall flow and execution of a software program. Common processor core elements include control, execution and temporary storage units. The load/store unit provides program control and instruction dispatch to the execution units.

The processor core incorporates a branching unit to control execution flow of the software program. An important feature of the branching unit is branch prediction. Branch prediction is used to minimize pipeline stalls by predicting the next logical path in the execution flow.

In addition to the branching unit, the RISC processor incorporates an instruction and data pipeline to increase processor throughput. Three stages (fetch, decode, and execute) are a minimum implementation for the pipeline in RISC architectures.

A performance factor to consider is the depth of the pipeline. A deeper pipeline has the potential to increase processor throughput. A consequence of deeper pipelines is a more complex processor implementation and degraded throughput when too many branches occur.

A branch occurring during the program execution will stall the pipeline. A processor core recovers from a branch by refilling the pipeline with the required instructions and data for the segment of code to be executed next. The time it takes to refill the pipeline has a direct affect on program execution latency. Pipeline stalls can significantly affect runtime software efficiency.

Execution units implement a processor core’s computational functionality. The primary execution unit is the integer unit (IU). The IU executes arithmetic and logical operations on a set of integers. To perform more complicated math functions, the RISC architecture incorporates floating-point units (FPU) and single instruction multiple data (SIMD) execution units.

The FPU provides single or double precision floating-point math capability. SIMD units provide vector math capability. The Altivec unit implemented in some of Freescale’s higher-performance PowerPC™ processors is an example of SIMD extension.

The two common RISC architectural implementations for adding parallel processing functionality are super-scalar and very long instruction word (VLIW). A super-scalar architecture adds parallel processing to the processor core by providing the ability to dynamically schedule instructions to multiple execution units simultaneously. A very long instruction word (VLIW) provides simultaneous execution unit processing; however, implementation is fixed at compile.

The bank of general-purpose working registers may also be called register files. These registers are used for temporary storage during program execution. In RISC-based architectures, a relatively large number of registers are necessary to optimize compiler efficiency and reduce load/store unit operations. The typical number of registers is between 32 and 128.

Cache memory may be used to increase the overall performance of a processor implementation by reducing the number of external memory accesses required. The use of cache in a processor design can significantly increase system performance.

The two main levels of cache commonly implemented are called L1 and L2, with the architectures being either write-thru or write-back. Cache memory usage is an important factor to consider. When implementing cache in an FPGA, it is typical to use block RAM for soft or firm processor cores.

The size of the cache to be implemented is a factor that must be considered when estimating block RAM resource utilization for the FPGA design. Cache misuse can significantly impact processor throughput.

As an example, cache misuse may occur when a commonly used code segment is replaced by another commonly used code segment resulting in cache thrashing. Cache thrashing can have serious consequences including reduced system performance.

Another consideration is the use of cache to lock critical code regions such as interrupt service routines. Locking code segments in cache can reduce program execution latency, and may also increase determinism and software performance.

The bus interface unit is the communication channel for the processor core to on-chip and off-chip devices. A two-bus strategy is a typical bus implementation approach. One bus will typically support high-speed devices, while the second bus supports slower-speed devices.

The high-speed bus is commonly referred to as the local bus and is typically used to interface with off-chip devices such as DDR memory. The slower bus is commonly referred to as the peripheral bus and is typically used for interfacing to on- or off-chip peripherals such as an Ethernet 10/100 media access controller (MAC). Some improvements that can be made to increase bus performance and reliability are presented in the following list.

Bus Implementation Performance Improvement Factors
1) Increased operational speed
2) Use of wider bus widths
3) Decoupling of data and address transfers4)Use of burst sequential access
5) Write buffer implementation
6) Support for both synchronous/asynchronous interfaces
7) Implementation of endianness (TCP/IP uses big endian)
8) Use of error detection/correction to maintain bus integrity
9) Use of the direct memory access (DMA) controller

Two common architectural bus implementations are Harvard and von Neumann bus architectures. The Harvard bus architecture is a two-bus implementation, supporting instruction and data access simultaneously. A majority of modern processors implement Harvard bus architecture interfaces.

An enhanced version of the Harvard architecture, called the modified Harvard architecture, includes two data buses to increase bus bandwidth. This architectural bus implementation is commonly seen on modern digital signal processors. The von Neumann bus architecture uses a single bus to access data and instructions.

One of the benefits of this less-complex bus architecture is that it requires fewer pins. Von Neumann is typically the common bus implementation for external or off-chip devices. For processor implementation within an FPGA, the trade-off between the two bus architectures is heavily dependent upon the number of FPGA I/O pins that must be used to implement the selected bus.

A disadvantage of von Neumann architecture is that the single data path may cause bottlenecks, thus producing degraded performance when compared with a Harvard implementation. An enhanced version of the von Neumann implementation is the modified von Neumann.

This implementation allows faster transaction times by running the bus clock faster than the processor core. However, due to the speeds of modern processors, this approach is not as practical.

Efficient interrupt implementation is an important factor in deterministic real-time embedded systems. The implementation of an interrupt controller provides a low latency mechanism for signaling the processor core when a device needs attention.

The interrupt controller provides the prioritization of processor peripheral events for devices attached to the processor core. The interrupt controller will typically be provided by the processor vendor as IP. The use of shadow registers can enhance fast context switching during interrupts. Interrupt software implementations should be fast and efficient. Lengthy computational processing should be limited to application code.

The MMU block provides a translation mechanism between the logical program data space, and the physical memory space. The MMU may be used to extend the range of accessible external memory. MMU implementation is usually accomplished by separating the data and instruction memory regions. Typically, the software implementation complexity will be increased when an MMU is used. The implementation of an MMU within a processor may have a significant effect on the processors real-time performance.

A final architectural consideration is the data-path for the software program. A processor is based on an efficient sequential instruction flow. Instruction flow interruptions and disturbances will impact performance. Floorplanning can be used to implement an optimized processor implementation data-path.

To read Part 1 go to: FPGA core types and trade-offs
Next in Part 3:  Processor, peripheral and software options

Used with permission from Newnes, a division of Elsevier.Copyright 2006, from “Rapid System Prototyping with FPGAs, ” by R.C.Cofer and Ben Harding. For more information about this title and other similarbooks, please visit

RC Cofer has almost 25 years of embedded design experience, including real timeDSP algorithm development, high speed hardware, ASIC and FPGA and project focus.His technical focus is on rapid system development of high speed DSP and FPGAbased designs. He holds an MSEE from the University of Florida anda BSEE fromFlorida Tech.

Ben Harding has a BSEE from the University of Alabama,with post-graduate studies in DSP, control theory, parallel processing androbotics. He has almost 20 years of experience in embedded systems designinvolving DSPs, network processors and programmable logic.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.