Generally, silicon devices that process information can be classified as being either SoCs, ASICs/ASSPs, or FPGAs (System on Chip, Application-Specific Standard Parts, Application-Specific Integrated Circuits, and Field-Programmable Gate Array components). It is very difficult to perform power, performance, and cost comparisons between these technologies without looking at specific applications and running benchmarks. However, it may be possible to take an application and map it to comparable devices in the three categories. If we were to do so, we would probably come up with a table like the one below.
Figure 1. High-level comparison of devices (Source: Dr. Shah)
It is evident from this table that each device category has its own strengths and weaknesses.
ASICs/ASSPs, SoCs & FPAGs: A brief description
ASICs/ASSPs are built to perform very specific functions efficiently, and — as such — are optimized in terms of performance, power, and cost. They offer the smallest die size and lowest power consumption, and they may have the best performance-to-power ratio. However, ASICs/ASSPs do not offer any flexibility; that is, once the device has been fabricated, it can only perform the specific function for which it was designed and nothing else. As such, ASICs/ASSPs are developed for applications or functions that have matured or are standardized with no requirement for change in the future. Since ASICs/ASSPs are built to perform very specific functions, from an ROI (return on investment) perspective, they make sense only when the volumes are large as so to be able to recoup their substantial development costs.
As their name suggests, SoC (System-on-Chip) devices comprise multiple IP blocks that are implemented on a single silicon die. These IP blocks generally consist of one or more processor cores (e.g., ARM, x86, PPC), GPUs, DSPs; hardware accelerators such as a security/crypto engine and a deep packet inspection engine; and communication interfaces (e.g., Ethernet, PCIe, RIO, SATA). Some SoCs may contain fewer hardware blocks, while others may have more depending on the target application and market segment. SoCs generally target multiple markets and segments, and are not as narrowly defined as their ASIC/ASSP cousins. These devices offer software flexibility via their processor cores and GPUs, but no flexibility when it comes to their hardware accelerators. That is, customers can change the software portion of the application, but the hardware accelerators only perform the functions for which they were originally designed. For example, if an SoC has a hardware block that supports a particular traffic management functionality (e.g., queuing, scheduling) or only supports a certain security algorithm, then any additional traffic management functions or security algorithms would have to be supported by the programmable portion of the SoC (i.e., the processor cores and GPUs). One may ask the question as to why do we need the hardware accelerators at all. The answer lies in the fact that, although processors are continually offering better and better performance, they still lag far behind the performance-size-power ratio offered by hardware accelerators.
Traditional FPGAs are configurable hardware devices. They offer no software programmability. FPGAs can be reconfigured to perform different functions by loading a new bit stream on to the device. They are great compute devices and are especially useful for applications that lend themselves to parallel computation. Thus, if the workload can be split in to computations that can be performed simultaneously in parallel and the results of these individual computations be aggregated later on, FPGAs will greatly outperform processor-based SoCs. They also provide deterministic latency, as opposed to processor-based systems.
However, the FPGA's hardware flexibility comes at a cost: FPGAs are much bigger in size and also consume more power than both ASICs/ASSPs and SoCs. They are also costlier than the other devices. FPGAs are made up of look-up tables (LUTs), Memory, DSP functions, and different types of I/O (input/output). The LUTs can be configured to perform different logic functions and can be connected to the memory and DSPs.
FPGAs are also used as glue logic to interface between two devices that don’t support matching interfaces, or as devices performing specific functions in low volume applications that require a high level of customization and performance. For example, FPGAs are used as in-line traffic managers and as I/O expanders when the SoC in a system does not have the required number of I/Os.
A new breed of fully-programmable devices: FPSoCs
Recently, the rigid classification described above has started to blur, where this blurring is primarily being driven by the needs of OEMs and the requirements of applications.
The big two FPGA companies — Xilinx and Altera (now Intel) — have developed product lines that feature hardened processor subsystems connected with traditional FPGA programmable fabric, all on the same die. Others, such as Microsemi, are building devices to capture the lower-end markets with their low-end microcontroller and FPGA-based solutions. Yet others, including Intel, are building server-based processors with FPGAs and touting a 20x increase in performance for customized application workloads. In the words of Intel: “The FPGA provides our customers a programmable, high-performance coherent acceleration capability to turbocharge their critical algorithms.”
Currently, there are two popular implementations that companies are following to produce these devices. One is to develop an integrated SoC-FPGA device on a single die, with the device partitioned into an FPGA portion and a processor plus processor subsystem portion (Figure 2).
Figure 2. Integrating the SoC-FPGA on a single die (Source: Dr. Shah)
The other approach is to put the SoC and the FPGA side-by-side as discrete devices, connect them, and house them in a single MCM (multi-chip-module) package (Figure 3).
Figure 3. Mounting the SoC and FPGA dice in a common package (Source: Dr. Shah)
These may be the first steps towards completely embedding FPGA fabric into the SoC. We call these devices “Fully Programmable SoCs” or FPSoCs, as opposed to regular SoCs that allow programmability only on the CPUs.
Architecturally, embedding FPGA fabric into an SoC would entail connecting the FPGA fabric to the chip interconnect and the processor subsystem (L3 Cache and DDR memory) so that it acts like any other accelerator in the system (Figure 4).
Figure 4. Potential FPSoC device architecture (Source: Dr. Shah)
Depending on the application, the FPGA fabric may require to be part of a coherency domain in the system. If we look at implementations that are prevalent these days, what we find is the use of existing so-called off-the-shelf FPGA fabric, which is “dropped in” as-is without modification. The problem with this approach is that these general-purpose FPGA fabrics are large — oftentimes 50% of the size of the whole SoC die — and embedding such a general-purpose FPGA fabric in an SoC would be very costly.
The solution to this problem lies in understanding the FPGA architecture. As mentioned earlier, FPGA fabric is made up of logic elements called LUTs, embedded RAM (EBR) blocks, DSP blocks, and I/Os, along with the routing that connects everything together. The EBR and DSP blocks (depending on their number) generally take up 15% of the size of the FPGA die; I/O consumes another 15%; miscellaneous stuff takes up 15 to 20%; and the remaining 50% of the die is consumed by the core FPGA fabric; i.e., the LUTs and the routing that connects them. In a general-purpose off-the-shelf fabric, 60 to 70% of the area is taken up by routing. By eliminating the I/O, reducing the EBR, and removing the DSPs — all of which can be made available elsewhere in the SOC — and optimizing the routing, one can project a size reduction of about 50%. Such a 50% reduction in size would make a reasonably-sized FPGA fabric comparable to the other hardware accelerators on the SoC. For example, a 40 to 60K LUT optimized fabric implemented using a 28nm process would be approximately 2mm2 , whule a 20 to 30K LUT device would be around 1mm2 . Assuming a x6-x10 ratio between LUTs and ASIC gates, the 40 to 60K LUT device could accommodate 300 to 600K ASIC gates, while a 20 to 30K device could accommodate 150 to 300K ASIC gates.
To provide the reader with a view of what can be accommodated in the 20 to 30K LUT device, we will give a few examples. An LZR1 lossless compression/decompression core supporting a data rate of 6Gbps with a compression ratio of between 1.5 and 2.0 takes up to 2.5 to 3.5K LUTS from a leading IP vendor. Similarly, an IEEE802.1ae MACsec for PON (Passive Optical Networks) IP block running at 16Gbps will consume only 180K ASIC gates. An IPSec Core requires approximately 100K ASIC gates yielding a performance of 5Gbps. One can also use multiple IPSec cores to increase the processing capacity or reduce the operating frequency of the core while maintaining the same performance.
Given the above discussion, it is easy to see how the embedded FPGA Fabric can act as a pre-processor and/or post-processor for other hardware IP blocks. Figure 5 illustrates a possible architecture of how the FPGA fabric can be connected to the other hardware acceleration blocks to pre-process and/or post-process the data going into or coming out of the hardware accelerators.
Figure 5. Proposed FPSoC device architecture illustrating how the
FPGA fabric would be connected (Source: Dr. Shah)
While FPSoCs offer advantages, there are also challenges in building such a device. Let's start with considering some of the key advantages, and then we'll move on to consider some of the key challenges as follows:
- Lower Frequency and Lower Power Consumption: As we mentioned earlier, FPGAs by design are parallel machines and have wide data paths; as such, the FPGA fabric can run at a relatively lower frequency and still meet the data rate requirements as compared a CPU core or other hardware accelerator. Power, as we know, is directly proportional to frequency, hence lowering the frequency reduces the power consumption. In one study , different sorting algorithms were implemented on an FPGA and on multiple SoCs. The results showed 2.5-fold power advantage when the FPGA was used as opposed to an SoC. This power advantage resulted primarily due to the lower frequency required by the FPGA as compared to the SoC.
- Deterministic Latency: The run time of a function on a CPU can vary from one run to another. This variability in the run time is stochastic in nature and is caused by interrupts, system calls, cache misses, and bus congestion to name a few. This variability is a source of concern in applications where deterministic latency is required, such as the L2 scheduling for 5G networks. However, if the same function were to be implemented on an FPGA, there would be no variability or jitter. FPGAs are deterministic machines.
- Application Acceleration: FPGAs are very efficient parallel processing machines. If an application can be parallelized, it can be run very efficiently on an FPGA. Applications like Hadoop, deep packet inspection, voice recognition, and LTE/5G channel scheduling are good candidates for FPGA fabric acceleration. In  and , the performance of an FPGA-based implementation is compared against a purely software-based implementation for two applications: 1) Sorting; and 2) Hadoop MapReduce. The authors found that application acceleration with the FPGA resulted in 10x and 20x performance improvements, respectively, when compared against best-in-class SoCs.
- Protection Against Evolving Specifications: Oftentimes, standards evolve at a faster pace than SoCs can match. In order to keep their systems compliant and interoperable, system architects often use discrete FPGA devices to accommodate changes in the specification. With an FPSoC device system, designers may not need the extra FPGA device — the embedded FGPA fabric can be programmed to address any incompatibilities — thereby bringing down the cost of the system.
- Bridging Devices and Glue Logic: Traditionally, camera and display interfaces were supported by either an RGB (LVCMOS) parallel bus or LVDS/subLVDS source-synchronous interfaces. However, these traditional interfaces are gradually migrating to MIPI-based solutions. The problem is that not all camera and/or display interfaces — and thus mobile SoCs — have migrated to the MIPI DPHY or MPHY standards. This means that a bridge device is required to connect between traditional interfaces and new ones. This bridge device is often an FPGA. In an FPSoC, FPGA fabric is already embedded in the SoC, and this fabric can be used to perform the bridging function. (Observe that, in Figure 5, we also connected the I/Os [not all I/Os need to be connected] to the FPGA fabric for this very reason.)
- OEM Customization: OEMs customize their solutions to provide differentiation and added value as compared to their competitors' offerings. FPSoC devices offer OEMS with the ability to not only customize the software, but also the hardware. For example, the customers may wish to change how incoming or outgoing traffic is scheduled and queued based on the application's requirements. With a Hybrid SoC-FPGA device, one can change the queue structure and the scheduling algorithms in RTL without sacrificing performance or functionality. Another example is of traffic classification and parsing, which can be customized in the FPGA to whatever the customer requires without sacrificing performance. It is important to note that customization can also be achieved by changing the software to implement the additional functionality on the CPU, but there is always a performance hit associated with this approach.
- Power: FPGA fabrics are power hungry as compared to ASICs/ASSPs and SoCs. The rule of thumb is that a single LUT is equivalent to 6 to 10 ASIC gates. Embedding the FPGA fabric may increase the overall power consumption. This increase in power must be mitigated by: 1) Carefully pruning the fabric routing as mentioned earlier, and 2) implementing appropriate power management techniques such as putting the fabric to sleep or deep sleep with or without state retention during non-active periods. During my tenure as Director of Architecture and Systems at Lattice Semiconductor, I introduced power management techniques in FPGAs, where such techniques had traditionally not existed (not just at Lattice, but across the whole FPGA industry).
- Size/Capacity: The size of the FPGA fabric would depend on the application acceleration required. This means that determining the size of the FPGA fabric required is a challenge if the target application space is not narrow enough.
- Tools: Integration of programming and debug tools is another big challenge. Proper integration of tools is essential for the success of such a device. System developers should be able to seamlessly transition from writing C code to RTL (or C-like-RTL) code. System developers should not require two separate skill sets to configure and program the device. The tools should also help in determining the optimum partitioning of work done on the CPUs, hardware accelerators, and the FPGA fabric.
- Configuration: FPGAs are programmable devices. They require a boot program called a “bit stream” to configure the FPGA fabric. Once configured, the FPGA fabric implements the desired RTL function. FPGA configuration, however takes a relatively long time. Depending on the size of the fabric, it can require several milliseconds. This configuration time may be tolerable if the FPGA fabric is configured only at boot time, but it is not acceptable if runtime reconfiguration is required.
- Secure Boot: The ability to perform a secure boot is a requirement for all modern systems. In the case of embedded FGPA fabric, the bit stream would be loaded into the device in much the same way as it is done today. The only difference is that it would be bundled with the boot code for the SoC. If dynamic reconfiguration is desired, or if the bit stream is updated in the field, then once again this would occur in much the same way as application code is updated today in a secure fashion. However, careful consideration has to be given with regard to developing the correct protocol or sequence for updating the configuration of the embedded FPGA fabric.
In this article, I have introduced the concept of the Fully Programmable SoC (FPSoC). These devices have FPGA fabric embedded as for any other hardware accelerator on the SoC. Embedding the FPGA fabric in the SoC offers several operational advantages and application performance benefits. Some companies have claimed that, for certain applications, the performance improvement can be in excess of 20x. With appropriate architectural decisions and innovations, it is quite possible that FPSoC devices may offer a better performance/power/cost ratio than traditional SoCs. Having worked extensively in developing multiple generations of FPGAs and multicore SoCs, I am confident that we will see more and more companies embracing this concept in the comings months and years.
- R. Mueller, et al., “Sorting Networks on FPGAs”, VLDB Journal, Feb 2012, Vol. 21, Issue 1, pp 1-23.
- Y.-M. Choi and H.-H. So, “Map-reduce processing of k-means algorithm with FPGA-accelerated computer cluster,” in IEEE Int Conf Application Specific, Systems, Architectures and Processors, June 2014, pp. 9–16.
Dr. Syed Ijlal Shah () has worked on, and contributed heavily to, leading-edge technologies over the past 15+ years. He has written over 40 academic and/or professional papers, and has contributed to several standards bodies, including the RapidIO Trade Organization, the ATM Forum, and Power.org. Dr. Shah has also been granted four patents, some of which are part of the RapidIO standard.
Dr. Shah has worked extensively on large multicore processors, FPGAs, Network Processors, and ATM/IP switches and routers. He was a key member of the System Architecture team at Freescale that worked on large multicore products as well as key member of the team that developed the CPort Network Processor. Dr. Shah has also worked extensively on IP/MPLS/ATM Traffic Management and Network Management, and has been granted patents on developing Packet Voice and Data Admission Control Algorithms.
Dr. Shah has worked at Nortel Networks, Freescale, and Lattice Semiconductor. At Lattice Semiconductor, he was Director Systems Architecture and was responsible for all new product architectures and definitions. He is currently working as Director Interconnect Architecture, at Arteris — a multinational semiconductor technology firm that develops On-Chip interconnect IP for use in System-on-Chip semiconductor designs for a variety of devices; in particular, Mobile and Consumer. Dr. Shah received his Ph.D from Columbia University, New York.