Basics of core-based FPGA design: Part 2 – System design considerations
In addition to the branching unit, the RISC processor incorporates an instruction and data pipeline to increase processor throughput. Three stages (fetch, decode, and execute) are a minimum implementation for the pipeline in RISC architectures.A performance factor to consider is the depth of the pipeline. A deeper pipeline has the potential to increase processor throughput. A consequence of deeper pipelines is a more complex processor implementation and degraded throughput when too many branches occur.
A branch occurring during the program execution will stall the pipeline. A processor core recovers from a branch by refilling the pipeline with the required instructions and data for the segment of code to be executed next. The time it takes to refill the pipeline has a direct affect on program execution latency. Pipeline stalls can significantly affect runtime software efficiency.
Execution units implement a processor core’s computational functionality. The primary execution unit is the integer unit (IU). The IU executes arithmetic and logical operations on a set of integers. To perform more complicated math functions, the RISC architecture incorporates floating-point units (FPU) and single instruction multiple data (SIMD) execution units.
The FPU provides single or double precision floating-point math capability. SIMD units provide vector math capability. The Altivec unit implemented in some of Freescale’s higher-performance PowerPC™ processors is an example of SIMD extension.
The two common RISC architectural implementations for adding parallel processing functionality are super-scalar and very long instruction word (VLIW). A super-scalar architecture adds parallel processing to the processor core by providing the ability to dynamically schedule instructions to multiple execution units simultaneously. A very long instruction word (VLIW) provides simultaneous execution unit processing; however, implementation is fixed at compile.
The bank of general-purpose working registers may also be called register files. These registers are used for temporary storage during program execution. In RISC-based architectures, a relatively large number of registers are necessary to optimize compiler efficiency and reduce load/store unit operations. The typical number of registers is between 32 and 128.
Cache memory may be used to increase the overall performance of a processor implementation by reducing the number of external memory accesses required. The use of cache in a processor design can significantly increase system performance.
The two main levels of cache commonly implemented are called L1 and L2, with the architectures being either write-thru or write-back. Cache memory usage is an important factor to consider. When implementing cache in an FPGA, it is typical to use block RAM for soft or firm processor cores.
The size of the cache to be implemented is a factor that must be considered when estimating block RAM resource utilization for the FPGA design. Cache misuse can significantly impact processor throughput.
As an example, cache misuse may occur when a commonly used code segment is replaced by another commonly used code segment resulting in cache thrashing. Cache thrashing can have serious consequences including reduced system performance.
Another consideration is the use of cache to lock critical code regions such as interrupt service routines. Locking code segments in cache can reduce program execution latency, and may also increase determinism and software performance.
The bus interface unit is the communication channel for the processor core to on-chip and off-chip devices. A two-bus strategy is a typical bus implementation approach. One bus will typically support high-speed devices, while the second bus supports slower-speed devices.
The high-speed bus is commonly referred to as the local bus and is typically used to interface with off-chip devices such as DDR memory. The slower bus is commonly referred to as the peripheral bus and is typically used for interfacing to on- or off-chip peripherals such as an Ethernet 10/100 media access controller (MAC). Some improvements that can be made to increase bus performance and reliability are presented in the following list.


Loading comments... Write a comment