Basics of core-based FPGA design: Part 4 – Implementing a design -

Basics of core-based FPGA design: Part 4 – Implementing a design

This last part in a four part series presents an example project concept that is based on an FPGA embedded hard core processor implementation. It addresses a complex design implementation that is beyond the scope of this series. Rather, the intent is to show a potential real-world advanced design example and discuss some of the factors that must be addressed in order to implement the system.

As an aid to the reader, application notes and reference papers are called out. These documents provide a lower-level implementation detail. For a broader understanding of the technology utilized in this example review of datasheets and user guides is appropriate.

For the purpose of this example, the result of the architecture and processor evaluation is Xilinx’s XC4VFX20 component. This FPGA includes a 405 PowerPC processor, tri-mode Ethernet block, embedded memory and DSP slices.

Our FPGA-based projected system requirements include a PCI bus interface, a 10/100 Ethernet connection, an external DDR memory controller for access to processor memory and an external Flash memory controller for access to stored program memory.

Additionally, the system will support an I2C interface, an SPI interface, an RS-232 UART implementation and access to external switches and LEDs via GPIO signals. The system will also support a DSP function, and custom circuits. Figure 14.4 below illustrates the proposed system architecture.

Figure 14.4. Processor concept example
Xilinx’s system tool for implementing the embedded processor within the FPGA is the embedded development kit (EDK). EDK integrates the system, hardware and software tools together into one package. By following the automated flow, an evaluation board may be used as a starting point for the project. The evaluation board chosen should include as many equivalent features as possible in common with the final target application. Availability of the right evaluation board can help reduce design schedule and risk.

While it may not be possible to obtain an evaluation board with exactly the mix of peripherals and exact FPGA component desired, it should be possible to find a board with a similar part from the targeted FPGA device family. For this example, we will obtain a board with a XC4VFX12 component. Most evaluation boards include DDR memory, the 10/100 Ethernet PHY, dip-switches, LEDs and an RS-232 interface. The evaluation board should also support cable configuration and processor debug via a JTAG header.

Once the evaluation board has been obtained, the EDK should be used to configure the evaluation board. This process involves stepping through the automated flow. An example automated project configuration flow follows:

1)   Select a new project using the automated flow
2)   Select the evaluation board that was obtained
3)   Select the processor (for this example, the PowerPC processor will be selected)
4)  Enable the processor core features (for this example, the processor core frequency will be 200 MHz, the bus frequency will be 100 MHz, cache-enabled, and a JTAG interface selected for debugging)
5)   Select the device to be used
6)  Big endian format is preferred for TCP/IP implementations
7)  Device peripherals, addresses and modes of operation (for this example, DDR memory, Flash, Ethernet, and RS-232 are selected and configured)

After these steps have been completed an initial project may be built and the FPGA configured. Using this project, initial development of the software can begin. This project is then stored and the configuration of the PCI, SPI, I2C, and timer can be performed. The configuration of these devices includes connecting each device to the processor bus.

The selection of the processor core will heavily influence the implementation of the processor bus. The processor bus is responsible for supporting communication between the processor core and its peripherals. The bus supported by EDK for the 405 core is an implementation of IBM’s CoreConnect bus structure. The bus connected directly to the 405 is the processor local bus (PLB). A secondary bus is also implemented and is called the on-chip peripheral bus (OPB). The two buses are connected through a bridge.

The bridge imposes clock cycle latencies for accesses to peripherals connected to the OPB. The OPB is a slower bus implementation than the PLB. The PLB should be reserved for high-speed and high-pri-ority devices, while slower and lower-priority devices may be mapped onto the OPB. Each peripheral device must have a defined mode of operation on the bus; master, slave or both. The memory range for each peripheral device must also be defined. For this example all the peripherals will be memory mapped.

The FPGA device-level and board-level decisions for the peripherals are interrelated with design implementation factors such as FPGA device placement and orientation, the physical relationship to other components on the board, the I/O standards for each FPGA pin, the I/O bank architecture and any I/O assignment limitations.

The decisions regarding the implementation of the external peripheral interfaces and related internal logic placement associated with each peripheral must take into account the overall FPGA data-flow. This effort must optimize the flow of data to and from the processor to high-priority and high-speed peripherals.

Floorplanning is an important design activity that can guide the tools to achieve the desired device layout and preferred data path flow. High-bandwidth and high-speed interfaces should be given extra care. Additional information can be found in Xilinx application note, XAPP653 3.3V PCI Design Guidelines.

The assignment of peripheral devices to the OBP and PLB buses is an important design step. The PLB bus assignments include the DDR memory controller, the Flash memory controller, the PCI bus controller, and the tri-mode MAC. The OPB bus assignments include the I2C controller, the SPI controller, the UART block and the GPIO interface pins accessing external LEDs and switches.

Additional devices added to the OPB include a system timer and an interrupt controller. The assignment of these blocks to the appropriate buses has a huge potential influence on the implemented processor’s efficiency. For example, connecting the PCI bus controller to the OPB bus would significantly degrade performance limiting design functionality.

Additional information can be found in Xilinx application note XAPP709 DDR SDRAM Controller Using Virtex-4 Devices , and XAPP701 Memory Interfaces Data Capture Using Direct Clocking Technique . Additional Ethernet interface information can be found in Xilinx application note XAPP443

Ethernet Cores Hardware Demonstration Platform
This design requires performance acceleration. Internal cache functionality will be enabled. The design also takes advantage of the 405 PowerPC core processor auxiliary processing unit (APU) interface to communicate efficiently with the DSP coprocessor functionality implemented within the FPGA.

The APU supports a high-bandwidth interface between the FPGA logic fabric and the pipeline of the 405 core. Details of an APU implementation may be found in Xilinx’s application note XAPP 717 Accelerated System Performance with the APU Controller and XtremeDSP Slices . Additional information may be found in Xilinx’s PowerPC Instruction Set Extension Guide.

The design also implements an interrupt controller. The interrupt controller is used to add additional interrupt lines. The PowerPC core natively supports two interrupt pins. These two interrupt inputs support critical and noncritical interrupts, respectively. Design details are presented in Xilinx’s application note XAPP778 Using and Creating Interrupt-Based Systems .

The main goal in using these processor features is to reduce the number of external memory accesses and decrease peripheral event response latency. Additionally, the DMA controller was used for the Ethernet device to increase data throughput and to off-load the processor core. Additional information on performance enhancement can be found in Xilinx’s ETP-367 paper “FPGA Embedded Processors: Revealing True System Performance .”

Many different software design implementation approaches can be taken to implement a set of fixed-functional requirements. The following paragraphs presents a potential viable set of software design decisions and factors. These are, of course, not the only potential solutions for implementing the required functionality; however, they should serve as a high-level design approach example. Figure 14.5 below illustrates the interrelationship between the hardware and software development flows.

Figure 14.5. Co-design Hardware and software tool interaction
The operating system selected for this example implementation is uCLinux. It is a good choice because it provides source code access, a TCP/IP stack and is a popular OS solution. Since uCLinux does not require an MMU, the MMU functionality of the 405 core is disabled. Software debugging may be streamlined by taking advantage of network file system (NFS) capability and gdbserver.

NFS allows a developer to export a working directory to a remote uClinux platform. This allows developers to compile code on their desktop development platform and then run the code remotely on the target system. The gdbserver program is the target server that provides connection to the development system gdbdebugger tool.

Another important design consideration is the order of code execution. As an example, it is common for a peripheral to require a specific register access order during the device’s initialization phase. It is possible for the PowerPC core to implement nonsequential instruction execution. A PowerPC instruction that can prevent out-of-sequence instruction execution is the enforced in-order execution of IO (EIEIO) command.

The C programming language was selected to implement the PowerPC software program. A few programming considerations to keep in mind for embedded development include:

1) Use the static syntax to control variable visibility
2) Use the nonvolatile syntax to prevent the compiler from optimizing out key variable
3) Code to reduce branching since stalls affect efficiency
4) Maintain an awareness of the state of stack usage
5) Disable interrupts when validating boot code
6) Include comments to clarify code intent, and to identify critical design factors and exceptions
7) Use null interrupt service routines (ISRs) for any unused interrupts

One of the biggest traditional design challenges involves bringing up a new hardware board for the first time. The challenges associated with this process can be significantly reduced by initially developing and verifying software on a known-good evaluation board platform.

Having access to a target evaluation board in advance of the access to the final hardware board allows progress to be made and increases confidence in the functionality of code developed before the final target board is available.

Access to a verified hardware platform can also be invaluable during board verification since it can provide a stable platform for operational comparison. The process of booting the target software within an FPGA embedded processor begins once the FPGA has been successfully configured.

In a well-designed system, the processor will be in a defined nominal state with the processor held in reset. Once the processor’s reset is released, the processor will jump to the reset vector location. The reset vector is a defined memory location, “0xFFFF FFFC ” in the PowerPC. The instruction at this location must be an unconditional branch to the first location of the boot code.

Most FPGA embedded processors have their boot code loaded in memory within the FPGA during the device configuration process. The boot code program is a non-compressed routine that contains the code for initializing the processor and then copying the application code to its runtime location within memory.

The first task of the boot code is to initialize registers to place the processor in a known state and defined memory map. This includes clocking speeds, execution mode, and other related processor-specific items requiring definition, such as the memory interface.

The PowerPC core is in big-endian mode by default after exiting reset, thus boot code must be in big-endian format. Program execution begins after a jump to the location in memory where the boot code is located. Before jumping to the application code, the boot code must set up the C environment. Once the C environment is configured, the boot code jumps to the boot-loader, and completes the boot-up sequence by performing a self-test to rule out potential hardware failures.

The memory contents are then placed in a known state, before copying operational code and jumping to the beginning of the application code. Before the operational code copy procedure occurs, the boot-loader checks for potential updates. If an update is available, the boot-loader will erase the nonvolatile area of memory containing the application code and store the new version. It will then copy this new code version to its specified runtime location. A related application note is Xilinx XAPP642 Relocating Code and Data for Embedded Systems .

The boot code is separate from the application code to protect the system from corruption of the boot code. Corruption of the boot code will render the system incapable of booting. Code updates may occur via updates through interfaces such as Ethernet or RS-232. A generalized board bring-up process is summarized in the following list.

Board Bring-Up of the FPGA Embedded Processor 

1) FPGA initialized from external nonvolatile FPGA configuration source
2) Processor powers-up in reset mode
– On release of reset, processor vectors to the reset code location
– May be either external nonvolatile memory or a volatile memory block on the FPGA loaded during FPGA configuration process
3) Initialize processor (typically written in assembly)
4) Set-up higher level language environment and jump to boot-loader section of boot code
5) Perform hardware integrity test including memory and other hardware that could affect processor operation
6) Update application code if newer version is available
7) Copy program from source to its runtime location
8) Jump to application code
9) Initialize RTOS and set-up BSP
10) Kick-off scheduler

Debugging can be accomplished by supporting access to signals and nodes internal to the FPGA. Signal test headers and signal access are discussed in the device-level and board-level design decision chapters. Since the 405 processor uses a 32-bit bus, at least 36 lines should be brought out to a test header. This supports parallel access to the processor bus and some control signals.

The test header should also include several grounded pins to support simplified test equipment connection. LEDs and switches may be included to help debug the design. Signal and internal node access may also be supported through a JTAG ChipScope Internal Logic Analyzer implementation. Implementation of a second JTAG port may allow additional 405 PowerPC debug capabilities such as trace capability. Implementation of an internal logic analyzer does require some FPGA internal resources to implement.

The implementation of processors embedded within an FPGA device can be a challenging and complex process. Careful consideration of critical system design elements can help streamline this process. Table 14.2 below provides a high-level FPGA embedded processor design checklist.

Table 14.2. Processor checklist
Ultimately, the implementation of an embedded FPGA processor design involves every aspect of system-level design with a higher level of flexibility. Since every aspect of the design implementation may be specified by the design team, there is a higher level of flexibility throughout the design cycle than is encountered with conventional discrete processor design.

The design team is responsible for the evaluation, selection and implementation of each functional element within the FPGA device. The design team has unprecedented freedom in the implementation of the design with the option to implement functionality within either the hardware or software domain.

Even late in the design process, the design team can repartition or reconfigure the design architecture and adjust critical design elements if the system performance benefits justify the required effort to implement the design changes. With the correct preparation, an organized and disciplined team can implement complex, customized designs efficiently.

To read Part 1 , go to “Core types and tradeoffs
To read Part 2 , go to “System design considerations
To read Part 3 , go to “Picking the right core options.

Used with permission from Newnes, a division of Elsevier.Copyright 2006, from “Rapid System Prototyping with FPGAs, ” by R.C.Cofer and Ben Harding. For more information about this title and other similarbooks, please visit

RC Cofer has almost 25 years of embedded design experience, including real timeDSP algorithm development, high speed hardware, ASIC and FPGA and project focus.His technical focus is on rapid system development of high speed DSP and FPGAbased designs. He holds an MSEE from the University of Florida anda BSEE fromFlorida Tech.

Ben Harding has a BSEE from the University of Alabama,with post-graduate studies in DSP, control theory, parallel processing androbotics. He has almost 20 years of experience in embedded systems designinvolving DSPs, network processors and programmable logic.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.