I recently attended the 2018 Xilinx Development Forum (XDF) in Silicon Valley. While at this forum, I was introduced to a company called Mipsology , a startup in the field of artificial intelligence (AI) that claims to have solved the AI-related problems associated with field programmable gate arrays (FPGAs). Mipsology was founded with a grand vision to accelerate the computation of any neural network (NN) with the highest performance achievable on FPGAs without the constraints inherent in their deployment.
Mipsology demonstrated the ability to execute more than 20,000 images per second, running on the newly announced Alveo boards from Xilinx, and processing a collection of NNs, including ResNet50, InceptionV3, VGG19, among others.
Introducing neural networks and deep learning
Loosely modeled on the web of neurons in the human brain, a neural network is at the foundation of deep learning (DL), a complex mathematical system that can learn tasks on its own. By looking at many examples or associations, a NN can learn connections and relationships faster than a traditional recognition program. The process of configuring a NN to perform a specific task based on learning millions of samples of the same type is called training .
For example, an NN might listen to many vocal samples and use DL to learn to “recognize” the sounds of specific words. This NN could then sift through a list of new vocal samples, and correctly identify samples containing words it has learned, using a technique called inference .
Despite its complexity, DL is based on performing simple operations — mostly additions and multiplications — in the billions or trillions. The computational demand to perform such operations is daunting. More specifically, the computing needs to execute DL inferences are greater than those for DL training. Whereas DL training must be performed only one time, an NN, once trained, must perform inference again and again for each new sample it receives.
Four choices to accelerate deep learning inference
Over time, the engineering community resorted to four different computing devices to process NNs. In increasing order of processing power and power consumption, and in decreasing order of flexibility/adaptability, these devices encompass: central processing units (CPUs), graphics processing units (GPUs), FPGAs, and application-specific integrated circuits (ASICs). The table below summarizes the main differences among the four computing devices.
Comparison of CPUs, GPUs, FPGAs, and ASICs for DL computing (Source: Lauro Rizzatti)
CPUs are based on the Von Neuman architecture. While flexible (the reason for their existence), CPUs are affected by long latency because of memory accesses consuming several clock cycles to execute a simple task. When applied to tasks that benefit from the lowest latencies such as NN computation and, specifically, DL training and inference, they are the poorest choice.
GPUs provide high computation throughput at the cost of decreased flexibility. Furthermore, GPUs consume significant power that demands cooling, making them less than ideal for deployment in data centers.
While custom ASICs may seem to be an ideal solution, they have their own set of issues. Developing an ASIC takes years. DL and NN are evolving rapidly with ongoing breakthroughs, making last year's technology irrelevant. Plus, to compete with a CPU or a GPU, an ASIC would need to use a large silicon area using the thinnest process node technology. This makes the upfront investment expensive, without any guarantee of long-term relevancy. All considered, ASICs are effective for specific tasks.
FPGA devices have emerged as the best possible choice for inference. They are fast, flexible, power efficient, and offer a good solution for data processing in data centers, especially in the fast-moving world of DL, at the edge of the network and under the desk of AI scientists.
The largest FPGAs available today include millions of simple Boolean operators, thousands of memories and DSPs, and several CPU ARM cores. All these resources work in parallel — each clock tick triggers up to millions of simultaneous operations — resulting in trillions of operations performed at each second. The processing required by DL maps quite well onto FPGA resources.
FPGAs have other advantages over CPUs and GPUs used for DL, including the following:
They are not limited to certain types of data. They can handle non-standard low precision more suitable to deliver higher throughput for DL.
They use less power than CPUs or GPUs — usually five to 10 times less average power for the same NN computation. Their recurring cost in data centers is lower.
They can be reprogrammed to fit any task but be generic enough to accommodate various undertakings. DL is evolving rapidly, and the same FPGA will fit new requirements without needing the next-generation silicon (which is typical with ASICs), thereby reducing the cost of ownership.
They range from large to small devices. They can be used in data centers or in an internet of things (IoT) node. The only difference is the number of blocks they contain.
All that glitters is not gold
An FPGA's high computational power, low-power consumption, and flexibility come at a price — difficulty to program.