Put a configurable 32-bit processor in your FPGA - Embedded.com

Put a configurable 32-bit processor in your FPGA


Employing a configurable processor within your FPGA gives you lots of options that may not have been available with a fixed microprocessor, particularly the ability to adapt to a wide variety of application requirements.

Embedded systems are very different from desktop PCs, but the underlying technology shifts are the same and follow similar growth trends. While desktop PCs are moving to 64-bit processor architectures to address growing memory requirements, embedded systems are rapidly moving to 32-bit processors for the same reason. The desktop/server computing market is consolidating around the x86 microarchitecture, and most of the innovation and differentiation is happening at the system level with dual, quad, or multicore architectures, and integrated graphic processor units and memory controllers. Similarly, embedded systems are consolidating around simple 32-bit RISC processors, while significant system-level developments such as multicore architectures, integrated peripherals, and configurable processing enable designers to adapt to rapidly changing application requirements.

According to iSuppli research reports, during 2007, the 32-bit microcontroller (MCU) market is expected to surpass the 8-bit market. As shown in Figure 1, the high-level trend is that while the 32-bit market is expected to outpace the growth of the rest of the semiconductor market, the 8-bit market has actually been shrinking for the last few years.

View the full-size image

The primary driver for this trend is the growing software content and complexity in embedded systems. The immediate consequence is that a wider memory bus (32 bits) is required to address larger code and data use by software programs. Unlike older microprocessors, 32-bit processors don't require memory techniques like segmentation to deal with larger memory spaces and hence make it easier to program. While 8-bit MCUs had to be programmed in difficult assembly languages to meet the small memory budgets (less than 32 kbytes), many 32-bit embedded applications can be programmed in C/C++, making embedded software developers more productive. More significantly, a large number of operating systems (real time and non-real time) have ready-made drivers and software libraries, enabling software developers to focus on their custom application development tasks.

Integration = lower prices
Smaller silicon process geometries in line with Moore's Law have brought down the cost of 32-bit embedded solutions to meet the price requirements of a broader range of applications. In addition, integrated peripherals and on-chip memory have further reduced the component and total bill-of-materials cost. By integrating peripherals optimized for vertical applications like cell phones and gaming consoles, the price of many devices has been reduced significantly, directly contributing to market growth.

Price pressure also necessitates that only a fixed combination of peripherals can be integrated into these systems, thus the peripheral mix is usually targeted to high-volume applications. However, one size doesn't always fit all, and many small, medium, and even some high-volume applications are underserved by off-the-shelf integrated solutions. As a result, designers must incorporate additional chips to expand the peripheral set, offload the processor, or add glue logic. This is where configurable processing solutions come in.Configurable 32-bit processing
According to a Gartner Dataquest report, shown in Figure 2, the use of FPGA-based embedded processing is growing and by 2010, it's estimated that 40% of FPGA designs will have embedded processors in them. Embedded system designers are leveraging FPGA-based configurable processing solutions because they can be tailored to their specific application or product requirements. The key benefits of this methodology are cost reduction through integration and product differentiation in the market.

View the full-size image

Designs can be modified for higher performance, lower cost, or different I/O standards by selecting a different part in the same FPGA family or retargeting the design to a newer FPGA. In this way, designers can future-proof designs by reducing the risk of obsolescence. This is an especially important factor for products that must have a long shelf life such as automotive or industrial applications.

The levels of configuration (or customization) that are available with a configurable processing system are:

Processor configuration:

  • Multiplier, divider, floating point unit, and others.
  • Instruction or data cache configuration.
  • Coprocessors or hardware accelerators.

System configuration:

  • I/O peripheral selection, customization, DMA options.
  • Memory peripheral selection, customization.

Application configuration:

  • RTOS selection, customization.
  • Application library/middleware customization.

Embedded networking requirements
Most products containing embedded systems require a networking or communication interface of some kind. Ethernet is one of the most widely deployed networking interfaces because of its low cost, ubiquity, and ability to connect to the Internet using protocols like TCP/IP. Depending on the target application, the requirements of the networking subsystem vary widely. Simple remote-control and monitoring applications need to transfer just a few kilobits per second, whereas high-end storage or video applications need sustained gigabit per second throughput.

For simplicity, let's use TCP payload throughput as the primary metric for performance comparisons. Table 1 illustrates sample applications and their TCP/IP payload throughput requirements.

Configurable embedded networking
FPGA-based processing solutions provide the flexibility to enable or disable higher-level features of processors, IP cores, and software platforms and to fine-tune many individual parameters until application requirements are met at the software level. In addition, performance-critical software functions can be identified using profiling tools and offloaded to hardware accelerators or coprocessors.

Let's look at three example Ethernet subsystems that can use IP cores to meet typical application performance requirements. Each design has a different system architecture, including processor configuration, Ethernet MAC IP configuration, and memory interface. In addition, these examples highlight various TCP/IP software stacks that can be used with the hardware subsystems. Because the hardware building blocks and software layers are built for customizability, you can incrementally scale up or down the performance of these systems based on application requirements.

Ethernet “Lite” subsystem
A minimal networking subsystem such as the one described in Figure 3 is sufficient for the simple networking interfaces in remote-monitoring or control applications. In this class of application, the TCP/IP-performance requirement is low (less than 1 Mbit/s), so a small TCP/IP stack such as LwIP (Lightweight Internet Protocol) stack without an RTOS is sufficient.

This could be implemented in simple polled mode using an Ethernet Lite IP without interrupts. The complete software, including a simple application layer, could all fit in the local memory available on FPGAs. Other required I/O interfaces, such as the RS-232 UART and GPIO shown in Figure 3, can be added to the basic subsystem.

Higher-TCP/IP throughputs (on the order of 10 to 50 Mbits/s) can be achieved by making changes to the minimal system in Figure 3 and moving to a more conventional 10/100 Ethernet solution like the one shown in Figure 4. The key changes are:

  • Addition of a DMA engine to the Ethernet MAC, making it interrupt-driven.
  • Addition of external memory to the system and caches to the processor.
  • A more sophisticated TCP/IP stack like the one in Linux (Clinux).

For applications that require TCP/IP throughput in excess of 100 Mbits/s, consider using tri-mode (10/100/1000) Ethernet MACs as either hard or soft IP cores (Figure 5). To achieve the 500+ Mbits/s throughput required for high-end applications, you can use advanced DMA techniques like scatter/gather DMA (SGDMA) in conjunction with FPGA hardware accelerators including a data realignment engine (DRE) and checksum offload (CSO).

To keep up with Gigabit Ethernet's the higher-data throughput, a higher-performance embedded (hard) processor or customizable soft processor on the FPGA may be needed in addition to larger cache sizes, like 16-kbit instruction and data caches. With respect to software platforms, advanced TCP/IP stacks available in Linux, VxWorks, Integrity, and QNX enable functions such as zero-copy and checksum bypass.

Many factors affect TCP performance—both hardware and software—and will influence the TCP throughput that can be achieved. These include:

  1. Processor, including frequency, features, and caches:
    • Frequency: TCP/IP protocol stacks typically copy the payload from a user buffer into a buffer controlled by the stack before the information is copied once again into the Ethernet MAC's FIFO. Some of these memory copies require processor cycles, as they occur in software. The processor is also involved in computing the TCP checksum, which involves reading the whole packet from memory. A faster processor coupled with a faster memory can perform both operations in a shorter time and keep up with the data rates.
    • Features: TCP/IP protocol stacks involve accessing a packet in terms of header and payload. As part of header processing, typical access involves reading specific data bits in the header. This results in quite a few shifts. In addition, multiply operations are performed on each packet that's processed. In a configurable processor, instructions that perform shift or multiply functions can be enabled to achieve higher performance.
    • Caches: Once a packet is copied over from Ethernet MAC into memory, the packet is passed around different layers of the TCP/IP stack. The packet-processing code in the stack is then executed. Having all of the code and the packet in cache greatly increases processor efficiency and improves Ethernet bandwidth.
  2. Memory: Memory access times and latency have a huge impact on system performance. Typical TCP/IP applications don't fit in local memory; the program and data are part of external memory. Time spent accessing data and instructions will have a big bearing on performance. The memory factor is typically linked with cache sizes. Increased cache size for instruction and data would help offset the latency and access times of an external memory.
  3. Ethernet MAC: The Ethernet MAC peripheral implemented in FPGAs provides flexibility in terms of mode of operation (no DMA versus SGDMA), packet FIFO depth, DRE support, CSO support, and jumbo frame support. Each option trades off area consumed by the MAC to offload features from the processor, thereby improving performance.
  4. TCP/IP stack: Optimized and flexible TCP/IP stack implementations are important factors that contribute to system performance. TCP/IP stack features, like support for CSO in hardware, zero-copy API (where data isn't copied from the application to stack buffers), and configurable stack options help in improving system performance.
  5. Message size: The size of the messages (application data) is another factor that affects performance. As message size decreases, the overhead from TCP/IP headers (like TCP, IP, and Ethernet headers) increases, reducing the obtained data throughput.

Most applications are likely to have a basic set of requirements in terms of price, performance, and feature set. While designing a product for a specific application, designers have to make the right tradeoffs to balance these requirements, which are also likely to change during the product life cycle to adapt to market conditions. Using a flexible, configurable platform enables design trade-offs that can be modified as needed without changing platforms or vendors.

Navanee Sundaramoorthy is an embedded manager at Xilinx Inc. He was previously the engineering manager for embedded platform debug solutions. Sundaramoorthy has an MS in electrical and computer engineering from Brigham Young University, Utah and BS in computer science and engineering from Anna University, India. He can be reached at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.