Designing with core-based high-density FPGAs -

Designing with core-based high-density FPGAs

One engineer's adventures designing with microprocessor-based FPGAs.

Click image to go to digital edition.

Modern field programmable gate arrays (FPGAs) are great for a wide range of high-speed, complex signal processing  but can be difficult to interface to external systems. Microprocessors are great for interfacing to other systems, especially when equipped with Ethernet for communications, but don't offer the same levels of performance.

Until recently, designers either had to work around the weak spots of the chosen device or combine the two devices; the latter approach presents new difficulties when the data rate between the signal processing and general processor is significant. Enter FPGA devices with built-in microprocessors, combining modern 32-bit microcontrollers and Ethernet media access controllers (MACs) with FPGA resources.

This article presents my experience with designing a nontrivial multiprocessor system, using three networked Xilinx Virtex-4FX-based controllers.

Problem and solution
The system being developed by my client was a high-powered, pulsed laser for a military application. Unlike laser pointers, which are continuous wave (CW) lasers, this system consists of four pulsed lasers, using a technique called Q-switching to emit a series of regularly-spaced laser pulses; the output of these four lasers are ultimately combined optically for the final output.1

So, in addition to general housekeeping, the client identified a need early on for a number of high-speed photodiodes to monitor various aspects of the generated laser pulses. Ultimately, this evolved into an eight-channel pulse detection and analysis system operating at 200 million samples per second (Msps) for each channel. Clearly, no embedded processor system was going to be able to handle that throughput, so an FPGA-based solution was envisioned.

At the same time, other requirements, such as a relatively large number of sensors (more than 200), a good number of actuators, and unique command and telemetry interface with an external host system for overall control and monitoring, argued strongly for a microprocessor-based solution.

The initial thought was to combine an FPGA with a microprocessor, but because it appeared the interface between the two would, in itself, provide a challenge, I decided to investigate the then-new Virtex-4FX devices (this was in the fall of 2005). In addition to the high-performance logic and memory resources expected in a modern FPGA, these devices incorporate several “hard IP” resources, specifically PowerPC 32-bit microprocessors and Ethernet MACs.

These hard IP (IP stands originally for intellectual property but is also used to identify a module that may be incorporated into an FPGA design, similar to a peripheral chip in a microprocessor board of old) augmented by a wide range of peripheral IP (such as interrupt controllers, serial ports, serial peripheral interfaces, memory controllers) provide the basis for a complete microprocessor system on a chip, with the benefit of supporting high-speed interfaces to custom logic entities.
Hardware design

The approach we took during the design of this system was to first partition the required functionality into two separate processing elements because the overall system design divided the system into a laser assembly and a supporting electronics rack (shown in Figure 1 ). The laser assembly contains the four laser resonators, optics, Q-switches, and the high-speed photodiode diagnostic sensors, and had to conform to strict physical interfaces.

Click on image to enlarge.

Because of several high-level decisions, the design of the laser assembly was further partitioned into two identical halves, each implementing two of the four lasers; therefore, the laser electronics also consists of two identical assemblies. The supporting rack fills a standard 19-inch electronics rack, and contains power converters, Q-switch drivers, eight high-power laser pump diodes and their drivers, host interfaces, and the master processor.

So, due to our high-level requirements (the laser assembly and support rack approach) and the resulting system design, the electronics design evolved into a multiprocessor solution, with three processors networked together to satisfy the functionality demanded.

To keep the development effort manageable, the three processor assemblies utilize a common “digital board,” each augmented with different analog boards; of course, the two laser processors used the same analog board design.

Here is the first instance where the flexibility of the Virtex-4FX devices paid real benefits; because the two analog board designs required different mixes of analog-to-digital (A/D) and digital-to-analog (D/A) converters, as well as various other digital I/O interfaces, it would have been difficult to meet both sets of I/O without significant waste.

However, by putting the actual interface circuits (such as photodiode circuitry, A/D converters, serial line drivers) onto the custom analog board, the interconnection between the two boards consists largely of direct FPGA I/O connections; hence, a different configuration, with the appropriate IP peripherals, allowed the same processor board to implement different functionality.

Now that the high-level system architectural decisions had been made, it was time to detail the designs of each of the processor subsystems. The first step was to separate the data acquisition and, where applicable, real-time analysis functions into high-rate and low-rate categories. The high-rate analyses were limited to relatively simple, high-speed algorithms that are performed on input data stream by dedicated FPGA logic. Lower-rate, more complex analyses performed on buffered data, assisted by the stream processing.

High-speed processing

The high-speed processing entailed several measurements:

  • Pulse detection.
  • Timing of pulse (delay, width, onset, offset).
  • Peak amplitude.
  • Total energy.
  • Inadvertent emissions.

These measurements had to be determined on a data stream at 200 Msps—a new sample every 5 ns! Fortunately, I had three things in my favor. First, the optics design ensured that the laser pulses—if they occurred at all—would always appear within a few hundred nanoseconds from when the Q-switches are pulsed; this allowed me to capture a block of A/D readings triggered by the Q-switch pulse, and process the data between pulses (as shown in Figure 2 ).

Click on image to enlarge.

Second, the analog design of the photodiode amplifier, filtering, and A/D conversion was very good, and simple thresholds were sufficient to pick out the pulses and inadvertent emissions (laser energy outside of the expected window after Q-switch firing). Finally, I had the FPGA logic and block RAM resources at my disposal, without which there really was no hope!

The first step in processing the laser data, after acquisition, is to compare against a threshold to detect whether a pulse or inadvertent emission is present; actually, three thresholds were used: pulse start, pulse end, and inadvertent emission. The logic for selecting which to use is simple:

  • From the Q-switch trigger until a pulse is found, the pulse start threshold is used.
  • From the starting of the pulse, the pulse end threshold is used until the signal returns below that.
  • After the pulse is complete, or after the window is filled, the inadvertent emission threshold is used to look for signal excursions above the nominal level between pulses.

The time of each pulse threshold crossing is noted in a register, and inadvertent events are counted.

As I mentioned earlier, the laser pulses always happen shortly after the Q-switches are triggered. The Virtex-4FX has a number of 18 kilobit block RAM resources, which can conveniently hold 1,024 samples of A/D values, which at 200 MHz represents a 5-µs window; fortunately, the pulses arrive well within that window. So, a simple state machine was used to look for the Q-switch trigger, then direct the next 1,024 samples into the block RAM.

After the block RAM has been filled, an interrupt is sent to the PowerPC, informing it of data to be processed. Because the lasers are pulsed at 5 kHz, the PowerPC must process the four channels of data within 200 µs. This processing is greatly simplified by the onset and offset delay measurements, as only we only need to sum (for pulse energy calculations) those few points.

(An interesting aside here is that, during development, a bug caused the pulse start and end offset measurements to be mishandled, and very weird behavior ensued. The number of pulses seen was exactly a tenth of what was expected, for three of the four channels, but the fourth channel was much worse, and it varied considerably. It turned out that because the offsets were wrong, most of the buffers were being processed, and the PowerPC was not able to keep up; interrupts were being missed. A not-so-subtle reminder of how little time 5 ns is to a PowerPC!)

Statistics were gathered over a 1/10 second window of the minimum, maximum, and average values for the pulse characteristics (for example, delays, width, peak amplitude, and total energy), and these were sent, along with a representative pulse, to the main processor for storage.
Other custom peripherals
Other custom peripherals were needed to support system devices, both unique and common. As an example of the former, as I mentioned earlier, the system actually consists of four lasers that are combined to create the final output. In order to generate a coherent pulse, the four must emit their pulses (nearly) simultaneously.

Each laser is sent a “sync pulse” which is used to trigger its Q-switch, and thence, an output pulse; unfortunately, because of mechanical and material differences, the delay between the electrical sync pulse and the output laser pulse varies between the four channels.

To allow us to calibrate the four channels, I designed a custom IP, using the Xilinx EDK new peripheral wizard to connect it to the internal processor bus. This peripheral accepts an externally provided sync pulse, stretches it as required by the Q-switches, and delays each output pulse as necessary to time-align the output laser pulses. In this case, the peripheral was fully custom, though the wizard automated the connection to the processor.

We also had to create custom peripherals for some common devices. For an example, we used a number of Analog Devices' multichannel A/D converters (the 16-channel AD7490) for acquiring sensor data; these were interfaced to the FPGA via a serial peripheral interface (SPI) with multiple chip selects. The A/D converter represented a challenge not well met by the stock SPI controller provided by Xilinx, because it requires the chip select signal to be asserted for the 16-bit transfer for each channel read. The stock Xilinx SPI controller provided by the EDK v9.1i tools supported two modes of chip select operation, namely automatic (per-byte), or manual.

Unfortunately, the SPI controller only supported 8-bit word sizes, which forces the manual mode; this in turn would have required each channel to either be polled or an interrupt for each of 96 sensors (six 16-channel A/Ds were used). This was even more aggravating because the controller had a 16-word FIFO, which was so tempting; if it were 16-bits wide, a single interrupt could be used to read all 16 channels from a single A/D converter.

So, I decided to create a new SPI16 controller, based on the supplied version, and use it for the A/D. This was an example of starting with a supplied IP and customizing it for your purposes.

It should be noted that the SPI16 development was quite trivial, as the supplied SPI IP already had parameters for the width of an SPI word, but they were not exposed to the design environment, so they could not be changed. All I had to do was recode them to 16 bits, make a few other minor adjustments, rename the IP to a unique name, and I was done. Finally, I believe the newer versions of the Xilinx EDK tools provide more flexible SPI controllers, so I would not have had to develop my own.

Software design
The software to support the hardware design in meeting the system requirements consisted of a number of elements, as illustrated in Figure 3 . There are two distinct boot-time programs (which I'll describe next), along with a real-time operating system (RTOS)—the Real-Time Executive for Multiprocessor Systems (RTEMS) and its board-support package (BSP). The application programs (“master” and “laser” for the respective processors) use the environment set up for them by MicroMonitor and RTEMS as a framework for implementing the required system functionality.

Click on image to enlarge.

Startup of the PowerPC in the Virtex-4FX is a bit unusual, as illustrated in Figure 4 . When you design an “embedded processing” system using this part, you must include at least one area of block RAM to serve as the processor's initial code and data storage; this is because the chip is designed to support small systems with no external memory. Because of this, the first code the PowerPC executes is packaged with the rest of the FPGA configuration data, typically stored in a Xilinx Platform Flash or other external nonvolatile memory.

Given that changing the Platform Flash in our system was going to be difficult once the boards were installed, I decided to keep this initial bootstrap code as simple as possible and use it to start up a more capable, network-aware boot monitor. On startup, the bootstrap simply examines several predefined blocks in the board's flash storage (“regular” flash devices on the processor's memory bus, not to be confused with the Platform Flash) for an executable image; for redundancy, up to four images are supported.

Any images found are validated against a CRC-32 checksum, and if that checks out, becomes a candidate for loading and executing. Once the list of bootstrap candidates has been determined, a simple menu is printed out the console serial port, allowing an external system a chance to select an image to load; for our system, this is merely a development feature, as we have no access to the console.

If no choice is made within a configurable time limit, the first valid image is loaded (merely by copying to a known location in SDRAM and jumping to the entry point—at this point, the bootstrap is complete.

Boot monitor
The boot monitor is responsible for taking over from the bootstrap and brings the system up to full operation. When I began this project in the fall of 2005, I had wanted to use RTEMS as the RTOS for this project, and seeing that there was no BSP for the Virtex-4FX in their distribution, I started with a posting to the RTEMS mailing list to see if anyone had been working on or considering such a BSP. 2

In addition to a couple of responses that proved fruitful there, I got one from Ed Sutter about an initial port of his MicroMonitor boot monitor that he had recently got working on the Virtex-4FX.3 While I wasn't yet concerned with a boot monitor, it seemed a good idea to check out a “working program,” since other than the usual “Hello world” app in the Xilinx EDK environment, I had nothing to work with. So, I downloaded the codebase and started there.

As Ed had informed me, he had the basic monitor running, with networking and flash drivers for the Xilinx ML403 development board; unfortunately, we were using different, bigger flash, and I wanted to use the hardware (Xilinx's) TEMAC (hard IP MAC), so new drivers were needed for those features. Of course, networking was critical for my application, and without the ability to write flash, it would be nearly useless, so I had the road laid before me on the old learning curve. Still, having an existing implementation for a similar hardware setup greatly eased the “getting started” stage.

Getting the networking up and running was my first priority, both in terms of risk-reduction (because the serial port was not available once the processors were installed) and because of Ethernet's superior performance for downloading code. Fortunately, Xilinx provides suitable low-level drivers that are operating-system agnostic, which greatly eased the burden of getting a suitable port to Microcross's MicroMonitor; the task was even easier due to the monitor's design emphasizing simplicity—specifically, no interrupts—and a provided hardware-specific template that contains skeletons of the low-level functions required by MicroMonitor.

Because our board had larger and different flash parts, I needed to develop suitable drivers for them in order to use MicroMonitor's Tiny File System (TFS). Because MicroMonitor was designed to be readily ported to any embedded processor board, a framework is provided for incorporating drivers tailored for the devices in use.

As is often the case for similar open-source software, both a general skeleton as well as a variety of examples are provided within the source tree. If your board happens to use an already supported device, you use the existing driver; if not, you either start with a similar driver and modify it as required, or if none are reasonably similar, you start with the skeleton. In my case, after researching the parts we were using and comparing it to existing drivers, I was able to identify a similar driver to serve as my starting point. A little careful reading of the data sheet, and I had a working driver.

RTEMS board-support package

The biggest hurdle at the start of the project was clearly getting a development environment and the BSP for RTEMS up and running. As mentioned in the previous section, I started this rather daunting task by a simple post to the RTEMS mailing list.4

In addition to the MicroMonitor tip, I received two offers of help and several other notes of interest; the two offers turned out to be invaluable. One was from Thomas Doefler, a member of the RTEMS Steering Committee, who offered to do all the work with the RTEMS BSP build system to integrate what I come up with into the standard tree; as the project evolved, he also served as a sort of guru and booster.

The other responder had already developed drivers for several of the peripherals, which he was so kind to supply to the effort. Together with Thomas, we were able to incorporate the drivers into a new BSP for the Virtex, which I was able to build and test with early prototypes of my applications.

Other drivers

Other drivers were needed for the IP chosen to communicate with my systems' devices. For example, we used a number of SPIs to connect multichannel A/D and D/A converters, Ramtron FRAM nonvolatile memories, an SD (Secure Digital) card for removable media, and a temperature monitor chip.

This required development of a low-level driver that plugged into RTEMS' driver manager framework, and specifically, its I2C/SPI subsystem. Fortunately, because my custom SPI16 peripheral was based on the stock peripheral, I was able to use the same driver for both; still, this did take a bit of time to get it exactly correct, but once done, I was able to communicate with all the above devices, except the SD card.

The SD card required additional work to handle the unique command structure for SD Cards, such that the block-level drivers of the RTEMS file system can read and write data as necessary. This could be an article all in itself, so I won't detail it further here, other than to illustrate a different choice that was made.
For the previous drivers, I implemented them within the BSP proper, essentially making them freely available to my application code. However, for the SD card “middleware” driver, it wasn't clear where to put it in the general sense; furthermore, the RTEMS BSP build system, based on automake and the like, is quite complex and sometimes quite difficult. So I made the alternate choice of building the driver as part of my application, which was relevant for only the main processor anyway, so exposing a bit of extra complexity there seemed worth it, at least to me! The important thing is to remember there are usually several ways to do most things in software!

Lessons learned
I learned these three lessons:

  1. Seek collaborators early and often. Start early by subscribing to a project's mailing list or forum and post a simple statement of your plans. If you're not sure how to start, ask for pointers in getting started. Offer to work with others who are working on similar applications, or who may be interested.
  2. Understand an open-source software project's licensing philosophy, and how it impacts new development. While the Xilinx EDK is supplied with extensive drivers for the IP supplied with the tools, the RTEMS licensing was incompatible with the Xilinx license terms, and therefore drivers needed to be written from scratch for the BSP. This could require extra work on your part, though again, you may be able to get some help!
  3. Expect to put in some good old “elbow grease,” especially if you are working with new target platforms. The more different the board and CPU are, the more work you will need to put in; on the other hand, you will get a much deeper understanding of, and appreciation for, the inner workings—both of the software and your hardware!

Robert S. Grimes is the president of RSG Associates, an embedded systems development firm based in Boston, MA. Bob is a 1983 graduate of MIT, where he first learned of the joys of combining hardware, firmware, and software to control some small aspect of the world at large. In recent years, he has been busy sorting garbage (he developed the control electronics for a recycling system), discovering water (he designed the FPGA “brain” in a near-infrared spectrometer that discovered water on the Moon during the NASA LCROSS mission), and developing vestibular research systems for diagnosing balance disorders at the Massachusetts Eye and Ear Infirmary (Boston, MA) and the University of Berne (Switzerland). Robert Grimes may be reached at

1.    Wikipedia definition of “Q- Switching” at
2.    Real-Time Executive for Multiprocessor Systems or RTEMS at
3.    Ed Sutter's MicroMonitor is “a free embedded system boot platform centered around an extensible embedded flash file system called TFS.”
4.    Grimes, Robert S. Comment posted on “RTEMS port to Virtex-4/PowerPC” at

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.