Selecting and benchmarking embedded FPGAs
Embedded FPGA, or eFPGA, refers to one or more blocks of FPGA fabric that are embedded in a device like an ASIC, ASSP, or SoC (see also ASIC, ASSP, SoC, FPGA – What's the Difference?).
To put this another way, eFPGA is a digital reconfigurable fabric consisting of programmable logic in a programmable interconnect, normally presented as a rectangular array, with data inputs and outputs positioned around the edges. An eFPGA typically has hundreds or thousands of inputs and outputs that can be connected to busses, data paths, control paths, to GPIOs or PHYs or whatever is desired.
All eFPGAs have look-up tables (LUTs) as basic building blocks. A LUT has N inputs selecting a small table whose outputs then represent any desired Boolean function of the N inputs. Some eFPGA LUTs have four inputs and some have six. Some LUTs have two outputs. LUTs typically have flip-flops on the outputs; these can be used to store the result or they can be bypassed. These LUT-register combos typically come in groups of fours, along with carry arithmetic and shifters to enable the efficient implementation of adders.
The LUTs receive all their inputs from a programmable interconnect network and all of their outputs are fed back into the programmable interconnect network.
In addition to the LUTs, eFPGAs may also include MACs (multiplier/accumulator blocks). These are also connected to the programmable interconnect network and are used to provide more efficient implementations of digital signal processing (DSP) and/or artificial intelligence (AI) functions. For memory, there are blocks of RAM, typically dual-port with a wrapper to make them look either wide and shallow or deep and narrow. As for the LUTs and MACs, these blocks of RAM are connected to the programmable interconnect network.
Finally, the eFPGA has an outer ring of input and output pins that connects the eFPGA to the rest of the SoC, and these pins are connected into the programmable interconnect network.
Software tools are used to synthesize Verilog or VHDL code to program the eFPGA logic and interconnect to implement any desired function.
eFPGAs are handy, new blocks of logic that can be used in many ways to increase the value of an SoC, including the following:
- Wide, fast control logic using hundreds of LUTs
- Reconfigurable network protocols
- Reconfigurable algorithms for vision or AI
- Reconfigurable DSP for aerospace applications
- Reconfigurable accelerators for MCUs and SoCs
- And much more
Today, there are a number of eFPGA suppliers. The main ones are Achronix, Flex Logix, Menta, and QuickLogic, and there are also some smaller vendors. With all these options, customers need to decide which one best meets their needs. While there are business considerations to consider, this article will focus on the technical factors.
Step 1: Process Compatibility
Typically, even at an early stage of IP evaluation, a company will have picked a foundry and process node or at least a small subset. eFPGAs are available today -- or are in development -- for TSMC, GlobalFoundries, and SMIC for process nodes including 65nm, 40nm, 28nm, 22nm, 16nm, 14nm, and 7nm.
However, not all vendors have eFPGA for all foundries/process nodes, at least not yet. It is important to check who is compatible with your process selection using their website. You should also see if the eFPGA in question has been validated in silicon with a report available under NDA.
And don't forget to check metal stack compatibility. Your choice of critical IP such as SerDes or your application may require you to use a specific metal stack, but not all eFPGA IP is compatible with all metal stacks.
Step 2: Array Size and Features
Not all eFPGA vendors can do very small eFPGAs and not all can do very large eFPGAs. In addition, the nature of the MACs and RAM they support can vary.
You probably have a general sense as to whether you need hundreds of LUTs or hundreds of thousands, along with the needs you have for MACs and RAMs. This may screen out some vendors.
Step 3: Benchmarking Performance Using RTL Representative of Your Requirements
eFPGA vendors will give you their software for evaluation purposes so you can determine -- for your RTL -- how much silicon area and what performance each eFPGA can achieve. You need the eFPGA to be able to operate over the same temperature and voltage range as the rest of your SoC, so make sure what you need is supported.
When benchmarking, it is important to compare apples to apples. For example, you should compare each eFPGA in the same process, at the same process corner (slow/slow or typ/typ or fast/fast), and at the same voltage and the same temperature. You should expect that the software tools from an eFPGA vendor will allow you to check performance at different process corners and voltage combinations.
Be careful that your RTL is appropriate for eFPGA. If you take RTL from your hardwired ASIC design, it will tend to have 20-30 logic stages between flip flops. If you put that in any eFPGA without optimization, it will run very slowly. In an eFPGA, there are always flops on the LUT outputs; this means they are "free" and you can use them to add more pipeline stages to your RTL to obtain higher performance in an eFPGA environment.
When it comes to your RTL, make sure you are testing what matters to you.
Consider a 16-bit adder. What you care about is how fast it runs, but the results you see may surprise you if you are not careful. Now visualize a large eFPGA. If the adder is placed in a corner of the array with the inputs and the outputs close by, the performance will be much higher than if you locate the adder in the middle of the array. That's because if you are observing the performance from the array input to the array output, the distance to reach the adder for the data input and for the adder output are much longer when the adder is in the middle of the array. In reality, the adder is the same and runs as fast in both locations. The issue is that your test case didn't isolate just the adder performance, but it also added in the signal runs required to reach the adder.
Below is a graphical example to make the point using a single LUT with wiring that is both very short and very long. The LUT speed doesn't change, but the delay through the interconnect to and from the LUT does.
To counter this effect, especially since you may be comparing two competitive eFPGAs of different sizes (all eFPGAs have some granularity to their sizes), what you need to do is to set up registers on the inputs and the outputs. This ensures the performance you care about is measured equally regardless of array size and placement.
Below is the RTL and a graphic for a self-contained 16-bit counter used for benchmarking.
module counter (clk, reset, done); input clk,reset; output done; reg [15:0] q1, q1r; reg done; always @(posedge clk or posedge reset) begin if (reset) q1 <= 0; else if (!reset) q1 <= q1 + !reset; end always @(posedge clk) begin q1r <= q1; end always @ (*) begin if (q1r == 16'hffff) done = 1; else done = 0; end endmodule
>> Continue reading this article on our sister site, EEWeb: "Considerations regarding benchmarking eFPGAs (Embedded FPGAs)."