The MCU guy’s introduction to FPGAs: The Hardware

A lot of my friends are highly experienced embedded design engineers, but they come from a microcontroller (MCU) background, so they often have only a vague idea as to what an FPGA is and what it does. When pressed, they might say something like “You can configure an FPGA to do different things,” but they really have no clue as to what's inside an FPGA or how one might be used in a design.

The thing is that MCUs are great for some tasks, but not so good at others. When it comes to performing lots of computations in parallel, for example, FPGAs will blow your socks off (so make sure you're wearing elasticated socks before you start playing with these devices). In this column we'll consider the hardware aspects of the FPGA universe; we then take a look at the FPGA equivalent to MCU software in The MCU guy's introduction to FPGAs: The Software.

Simple FPGA fabric
In the context of an integrated circuit, it's common to hear the term fabric , which is used to refer to the underlying structure of the device. (As a point of interest, the word “fabric” comes from the Middle English fabryke , meaning “something constructed.”) Let's start with the core programmable fabric inside an FPGA…

If we were to peer inside the FPGA's package, we would see its silicon chip (the technical term is the die ). The programmable fabric is presented in the form of an array of programmable logic blocks as shown in the image below. If we “zoom in” with a metaphorical magnifying glass, we see that this fabric comprises “islands” of logic (the programmable logic blocks) basking in a “sea” of programmable interconnect.


A generic representation of fundamental FPGA programmable fabric.

Why yes, I did create this image with my own fair hand, and I am indeed rather proud of it. Thank you so much for noticing (grin). If we zoom in further, we see that each of the programmable blocks contains a number of digital functions. In this example, we see a 3-input lookup table (LUT), a multiplexer, and a flip-flop, but it's important to realize that the number and types and sizes of these functions varies from family to family.

The flip-flop can be configured (programmed) to act as a register or a latch; the multiplexer can be configured to select an input to the block or the output from the LUT; and the LUT can be configured to represent whatever logical function is required.

A closer look at LUTs
Our simple example shown above featured a 3-input lookup table (LUT). In the real world, even the simplest FPGAs use 4-input LUTs, while larger, more sophisticated devices may boast 6-, 7-, or 8-input LUTs, but we'll stick with a 3-input version for the sake of simplicity.

We will be discussing the various types of FPGA implementation technologies in a future column. For the moment, we need only note that the programmable elements inside the FPGA may be implemented using antifuses, Flash memory cells, or SRAM memory cells. Let's first consider an FPGA created using an antifuse technology. This is a one-time programmable (OTP) technology, which means that once you’ve programmed the FPGA it stays that way forever.

The easiest way to visualize this is as a cascade of 2:1 multiplexers (MUXs) as shown below. In the case of our antifuse-based FPGA, programming the device would essentially “hardwire” the inputs to the first wave of MUXs to the appropriate 0 and 1 values required to realize the desired logical function. The values shown in the illustration below reflect the fact that we are using this LUT to implement the equation y = (a & b) | c from the previous image. In reality, the MUXs would be implemented using a branching “tree” of FETs, but we really don’t need to worry about the lowest-level implementation details here.


An antifuse-based LUT in which the input values are “hardwired” (left) and an
SRAM-based LUT in which the inputs are fed from SRAM cells (right).

Another very common type of FPGA implementation technology is based on the use of SRAM configuration cells. Once again, we will consider this in more detail in a future column. All we need to note here is that when the board is first powered-up, the SRAM-based FPGA is loaded with its configuration (we can think of this as programming the device). As part of this configuration, the SRAM cells acting as inputs to the LUT’s multiplexers are loaded with the desired 0 and 1 values as illustrated above.

I’ve not shown the mechanism by which the 0s and 1s are loaded into the SRAM cells because I don’t want to confuse the issue. For the purposes of these discussions, we really don’t need to worry about how this “magic” takes place. The only thing I will mention here (to give you something to ponder) is that — using a technique called partial reconfiguration — it is possible for one part of the FPGA to instigate the reconfiguration of another part of the FPGA (and vice versa, of course). For those readers coming from a microcontroller and/or software background, we might think of this as being the hardware equivalent of self-modifying code. This means that this technique is very, very powerful, but it comes with the capability to introduce problems that are horrendously difficult to isolate and debug.

General-purpose inputs and outputs
The device will also include general-purpose input/output (GPIO) pins and pads (not shown in the above illustration). By means of its configuration cells, the interconnect inside the device can be programmed such that the primary inputs to the device are connected to the inputs to one or more programmable logic blocks. Also, the outputs from any logic block can be used to drive the inputs to any other logic block and/or the primary outputs from the device. Furthermore, the GPIO pins can be configured to support a wide variety of I/O standards, including voltages, termination impedences, slew rates, and so forth

The very first FPGA was similar to the architecture discussed in this column. Introduced by Xilinx in 1985, the XC2064 (which was created at the 2um technology node) contained an 8 x 8 = 64 array of logic blocks, each containing a 4-input LUT along with some other simple functions. Since that time, FPGAs have evolved dramatically, as we shall see…

Index

More-sophisticated FPGA architectures
As we noted on the previous page, the very first FPGA, the XC2064, which was introduced by Xilinx in 1985, contained an 8 x 8 = 64 array of logic blocks, each boasting a 4-input LUT along with other simple functions. Since they were limited in terms of capacity, early FPGAs were employed only for relatively simple tasks, like gathering glue logic or implementing rudimentary state machines. Over time, however, things began to change…

Year-by-year and node-by-node, the capacity and performance of FPGAs increased while their power consumption decreased. The widespread use of 4-input LUTs persisted until around 2006. In fact, at the time of this writing, the smaller FPGA families still use 4-input LUTs, but higher-end devices may use 6-, 7-, or 8-input LUTs. These big-boys may be used as a single large LUT or split into smaller functions, such as two 4-input LUTs or a 3-input and a 5-input LUT. In a really high-end device, this programmable fabric is capable of representing the equivalent of millions (sometimes tens of millions) of primitive logic gates.

If a logical function — say a counter — is implemented using the FPGA's programmable fabric, that function is said to be “soft.” By comparison, if a function is implemented directly in the silicon, it is said to be “hard.” (As these functions become larger and more complex, we tend to refer to them as “cores.”) The advantage of soft cores is that you can make them do whatever you want. The advantage of hard cores is that they occupy less silicon real estate, have higher performance, and consume less power. The optimal solution is to have a mix of soft cores (implemented in programmable fabric) and hard cores (implemented directly in the silicon). Thus, in addition to their LUT-based programmable fabric, today's FPGAs may be augmented with a variety of hard cores as illustrated below:


A more sophisticated FPGA architecture.

For example, the device might contain thousands of adders, multipliers, and digital signal processing (DSP) functions; megabits of on-chip memory, large numbers of high-speed serial interconnect (SERDES) transceiver blocks, and a host of other functions.

Index

FPGAs with embedded processors
This is where things start to get really exciting… One of the things you can do with the regular programmable fabric in an FPGA is to use a portion of it to implement one or more soft processor cores. And, of course, you can implement processors of different sizes; for example, you might create one or more 8-bit processors along with one or more 16-bit or 32-bit soft processors — all in the same device.

If the FPGA vendor wishes to provide a higher-performance processor that occupies less silicon real estate and consumes less power, the solution is to implement it as a hard core. One very exciting development is the recent introduction of SoC FPGAs by companies like Altera and Xilinx. Consider theExample shown below.


A new class of SoC FPGAs.

This little beauty combines a full hard core implementation of a dual ARM Cortex-A9 microcontroller subsystem (running at up to 1GHz and including floating-point engines, on-chip cache, counters, timers, etc.), coupled with a wide range of hard core interface functions (SPI, I2C, CAN, etc.), and a hard core dynamic memory controller, all augmented with a large quantity of traditional programmable fabric and a substantial number of general-purpose input/output (GPIO) pins. (As reported in this column, a forthcoming SoC FPGA at the 16nm node will boast quad core 64-bit ARM Cortex-A53 processors, dual core 32-bit ARM Cortex-R5 real-time processors, and an ARM Mali-400MP graphics processor. These aren’t your grandmother's FPGAs!)

A traditional embedded system architect can lay one of these devices down on the circuit board and treat it as being a conventional high-performance dual-core ARM Cortex-A9 microcontroller. When power is applied to the board, the hard microcontroller core immediately boots up and becomes available prior to any configuration of the programmable fabric. This saves time and effort and lets software developers and hardware designers start development simultaneously.

One scenario is that the software developers capture their code, run it on the SoC FPGA's Cortex-A9 processors, and profile it to identify any functions that are slugging performance and acting as bottlenecks. These functions can then be handed over to the hardware design engineers for implementation in programmable fabric, where they (the functions, not the design engineers) will provide dramatically higher performance using lower clock frequencies while consuming a fraction of the power.

Just a moment! Previously we noted that hard core implementations of functions (and the ARM Cortex-A9 processor shown above is a hard core) have higher performance and consume less power than their soft core equivalents. But now we're saying that if a software function running on a hard core processor is a bottleneck, we can implement it in programmable fabric where it will… you've got it, provide higher performance and consume less power. How can this be? Do I have a clue what I'm talking about? All will be revealed on the next page…

Index

Processors versus hardware accelerators
Just to set the scene and remind ourselves how we came to be here, earlier in this article I said:

If a logical function — say, a counter — is implemented using the FPGA's programmable fabric, that function is said to be soft. By comparison, if a function is implemented directly in the silicon, it is said to be hard. As these functions become larger and more complex, we tend to refer to them as cores. The advantage of soft cores is that you can make them do whatever you want. The advantage of hard cores is that they occupy less silicon real estate, offer higher performance, and consume less power.

But later, when talking about the SoC FPGAs containing dual ARM Cortex-A9 hard core processors, I said:

One scenario is that the software developers capture their code, run it on the SoC FPGA's Cortex-A9 processors, and profile it to identify any functions that are slugging performance and acting as bottlenecks. These functions can then be handed over to the hardware design engineers for implementation in programmable fabric, where they (the functions, not the design engineers) will provide dramatically higher performance using lower clock frequencies while consuming a fraction of the power.

So am I trying to have things both ways? On the one hand I seem to be saying that hard core implementations of functions (and the ARM Cortex-A9 processors discussed above are hard cores) have higher performance and consume less power than their soft core equivalents. But on the other hand I'm saying that if a software function running on a hard core processor is a bottleneck, we can implement it in programmable fabric where it will provide higher performance and consume less power. How can this be?

Actually, this really is a surprisingly easy concept to wrap one's brain around (it's a tad harder to implement, of course). The thing is that general-purpose microprocessors and microcontrollers are really horribly inefficient — the only reason they appear to be so powerful is that we can ramp-up the frequency of the system clock to make them perform more operations per second. Power consumption is, of course, a function of clock frequency, so doubling the frequency doubles the power consumption.

And even if we do increase the clock frequency, this still leaves the processor “thrashing around” when it comes to performing large amounts of data processing and digital signal processing (DSP) functions. As a simple example, suppose we have three 10 x 10 matrices, called a, b, and y, where each element in these matrices is a 32-bit integer. Suppose that we wish to add the contents of matrix a to the contents of matrix b and store the results in matrix y. If we were to do this on a processor, the pseudo code might look something like the following:


Pseudo code for a 10 x 10 element matrix addition.

Let's reflect on how the processor handles this. We start with a read instruction that loads the value of the first element from matrix a into the CPU. Next we read the corresponding value from matrix b and add it to the value currently stored in the CPU. Then we store the result from our calculation to the appropriate element in matrix y somewhere in the system's memory. And now we have to do the whole thing again… and again… and again… for each of the matrix elements.

By comparison, we could create a dedicated hardware accelerator using the FPGA's programmable fabric. This hardware accelerator could comprise one hundred 32-bit adders, which means that the entire matrix addition could be performed in a single clock cycle. In turn, this means that the clock controlling this hardware accelerator could be running at a much lower speed than the CPU clock, thereby consuming significantly less power.

Of course I'm being a little over-simplistic here, because the CPU will have to load the input values into the programmable fabric and then retrieve the results, but this could be achieved efficiently using a DMA-type process. Furthermore, as opposed to simply adding the two matrices, we might wish to perform significant amounts of logical and mathematical operations on each element, in which case the programmable fabric option starts to look very, very attractive.

The important point to understand is that processors are wonderful when it comes to performing decision-making control tasks, while hardware accelerators are more suitable when it comes to performing large quantities of repetitive data-processing tasks. Thus, the ideal solution is to achieve the optimal balance between those functions that are implemented in the processor and their compatriots that are implemented in a hardware accelerator.

And one final interesting aspect of all of this is that, after the main processor has provided the hardware accelerator with the appropriate data and instructed it to execute its task, the processor can leave the accelerator to perform its magic while it (the processor) is free to go off and do something else. When the accelerator has completed its mission, it can signal the processor, which will retrieve the data when it is ready and able to do so.

Furthermore, the design team may decide to implement a large number of hardware accelerators in the programmable fabric, each tailored to perform a different task. In some cases, these accelerators will work in isolation, communicating only with the main processor; in other cases, one accelerator may hand its results over to another, and so on and so forth until the final accelerator in the chain hands its results back to the main processor.

I am afraid that this column provides only a very modest overview to what can be quite a complex topic, but I hope that it will provide food for thought and stimulate conversation. What do you think? Does this make sense, or does it raise more questions than it answers?

Index


Join over 2,000 technical professionals and embedded systems hardware, software, and firmware developers at ESC Boston May 6-7, 2015, and learn about the latest techniques and tips for reducing time, cost, and complexity in the development process.

Passes for the ESC Boston 2015 Technical Conference are available at the conference's official site, with discounted advance pricing until May 1, 2015. Make sure to follow updates about ESC Boston's talks, programs, and announcements via the Destination ESC blog on Embedded.com and social media accounts Twitter, Facebook, LinkedIn, and Google+.

The Embedded Systems Conference, EE Times, and Embedded.com are owned by UBM Canon.

16 thoughts on “The MCU guy’s introduction to FPGAs: The Hardware

  1. “Maxnn”A lot of my friends are highly experienced embedded design engineers, but they come from a microcontroller (MCU) background, so they often have only a vague idea as to what an FPGA is and what it does.”nnI guess I would fall into this category

    Log in to Reply
  2. “To a certain extent, this is true – you can “get by”. You write FPGA code in VHDL or Verilog and don't have to worry about LUTs etc.nnHowever understanding what's happening at the lower levels makes it easier to write better code (ie. faster, using le

    Log in to Reply
  3. “Hi Clive, great write-up and easy to understand. Overall, it seems the best way to go is to figure what the application needs are and then determine what tasks can be performed by the MCU and the FPGA, which minimizes the time it takes for an engineer to

    Log in to Reply
  4. “I'd echo Cdmanning's comment — and amplify it a bit — in the early days of FPGAs (circa 1985) you actually defined the content sof each lookup table by hand — then we moved to a higher level of abstraction — these days most designers capture their des

    Log in to Reply
  5. “As you say, performing an initial production run using an FPGA and then moving to an ASIC for higher volume production runs is one way to go, but I don't know how often that happens in real life.nnOne very common scenario is to use one or more FPGAs to

    Log in to Reply
  6. “I picked up FPGAs a few years ago, initially without much knowledge of the underlying architecture. Over time, I gained a better understanding of what was going on that the low levels, and that did help.nnI did a lot of writing code and just seeing what

    Log in to Reply
  7. “One thing that helps reduce the time required for the iterative process is to use the simulation tools (ModelSim is included with both Altera and Xilinx compilers). It's much faster to do the simulate/modify until you've got the logic right. It's also ne

    Log in to Reply
  8. “The designs I typically work on don't have enough volume to justify the upfront expense of an ASIC so we just stick with the FPGA for production. Of course if you're building millions then that's the way to go (assuming you have a design set in stone)”

    Log in to Reply
  9. “I agree — they were useful articles — what I might do is ask Duane to go back through them with the benefit of hindsight — but that will have to be after ESC Boston, which is next week (eeek!)”

    Log in to Reply
  10. “I think I would also agree with Max (of course) and Cdmanning and go a little further and confirm comment that writing code for micros at embedded level is better if the internal workings of the chip are understood.nI only wish the halfway stage of PSoc

    Log in to Reply
  11. “Excellent illustrations and descriptions. Well done Max!nI often need to explain the hardware approach to someone who is used to writing software/firmware. Since I began my career drawing NAND gates on vellum, I tend to look at algorithms from a diffe

    Log in to Reply
  12. “First of all, thank you so much for your kind words. Second, it's funny that you should talk about a follow-on column, because I posted the “Software” part of this earlier today — I'd be very interested to hear what you think of the way I've described

    Log in to Reply
  13. “Duane… “I picked up FPGAs a few years ago…” And as I remember did a very fine series of articles on your endeavours. Max, could Duane's articles be reposted somewhere on Embedded.com – they would be as useful now as then I think?”

    Log in to Reply
  14. “@Max…be careful about that…with the benefit of hindsight Duane might say “How could I have done that??” and delete something that would vitally benefit a newbie just starting out who would otherwise make the same mistake…..”

    Log in to Reply
  15. “HI,nThanks for this series of articles. I just sent them to a software colleague who wants to better understand the firmware in our joint project. nNote that CPU power increases more quickly than linear with frequency. See:nhttps://physics.stackexcha

    Log in to Reply
  16. “Hi there — thanks for taking the time to post this comment — it's always nice to hear that someone is reading my columns — I'd like to hear what your software colleague thinks about them 🙂 “

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.