Hybrid architecture speeds AI, vision workloads - Embedded.com

Hybrid architecture speeds AI, vision workloads


A novel hybrid data-flow and Von Neumann architecture can accelerate workloads including neural networks, machine learning, computer vision, DSP and basic linear algebra subprograms.

Quadric, a Silicon Valley startup, has built an accelerator designed to speed both AI and standard computer vision algorithm workloads for edge devices such as robots, factory automation and medical imaging. The company’s hardware architecture is a novel hybrid data-flow and Von Neumann design which can handle workloads including neural networks, machine learning, computer vision, DSP and basic linear algebra subprograms.

“Right from the start, we were very aware that AI is not the only application that’s needed for on-device computing on edge devices,” Quadric’s CEO Veerbhan Kheterpal told EE Times . “The developers of these products need for the full system to be able to run classical high-performance computing algorithms, along with AI. That’s really the full system requirements.”

Kheterpal stressed that the architecture is not a collection of accelerators for individual workloads. Rather, it’s a unified architecture with a data-parallel instruction set designed to accelerate varied workloads, including AI inference.

“Where AI is moving lately, there are some interesting trends around replacing entire layers with a fast fourier transform (FFT),” said Daniel Firu, Quadric’s chief product officer. Quadric is positioning itself to accelerate those types of workloads, citing a recent paper from Google in which researchers accelerated a transformer network by replacing some layers with an FFT. Google replaced the self-attention sublayer of a transformer encoder with an FFT to generate a network that achieved 92 percent accuracy on the BERT benchmark; training was up to seven times faster on GPUs or twice as fast on Google TPUs.

Quadric’s developer kit, an M.2 card with the Q16 processor and 4 GB of external memory (Source: Quadric)

Vineyard robots

Quadric’s three co-founders, Veerbhan Kheterpal, Daniel Firu and Nigel Drego, previously founded 21, a bitcoin mining company which was sold to Coinbase. Quadric, Burlingame, Calif., didn’t start out designing chips. Instead, it originally built agricultural robots that could walk up and down Napa Valley vineyards looking at the vines and sending alerts when it saw irrigation leaks or pests.

Veerbhan Kheterpal (Source: Quadric)

“When we were building it, we realized that it was not going to be a viable product built from the drone supply chain at $5 to $10,000,” said Kheterpal. “It would have to be built from the tractor supply chain costing $50,000, and carry large PCs with GPUs on it, with tons of cameras. That’s when we set out to look under the hood of all that robotics software and discovered what was fundamentally causing this energy need to go up with platforms like Nvidia and Intel.”

The company pivoted to building an accelerator chip – “the chip we wished we had,” according to Firu.

A seed funding round was launched in 2017, followed by a Series A round that generated $13 million from potential customers including Quadric’s lead investor, Japanese automotive Tier-One Denso. Quadric’s total funding is $18 million.

Turing complete

Quadric’s employs an instruction-driven architecture that takes elements from data-flow architectures and combines them with elements of a Von Neumann machine. The aim is to replace heterogeneous systems in edge devices with something less complex. As Turing complete machines, Quadric Vortex cores offer a combination of acceleration with flexibility, the company claims. The architecture is scalable in terms of arrays of cores, and portable down to advanced (7- or 5-nm) process nodes. This suits it to edge device applications with power budgets between approximately hundreds of milliwatts to 20W.

The company’s first chip, the Q16, is an array of 16 x 16 Vortex cores. Each core has the ability to perform matrix multiplication and AI calculations, but each also has a multifunctional ALU, for operations like AND, OR, reduction, shift and others. Software allows developers to express varied algorithm types, including LSTM activation functions and more. If-Then-Else statements are available across the entire array, allowing developers to take advantage of fine-grained sparsity.

Each core in the array has single-cycle access to its neighboring cores, plus single-cycle access to in-core memory of 4 Kb. On-chip memory is also included alongside the array, giving cores low-latency, deterministic access.

The cores operate in parallel in what Quadric calls a “single instruction, multiple decode” manner; each core gets the same instruction with every cycle. But based on dynamic data at runtime, each core can interpret that instruction differently. That allows cores, or groups of cores, to perform slightly different functions.

Also included is a dedicated broadcast bus that optimizes bandwidth into the array and can be used to broadcast constants, such as neural network weights, into all the cores at once (Firu said many computer vision algorithms also have some loop-invariant information which can be mapped onto the bus).

Dynamic information enters the array via static, software-controlled load-store units which allow for deterministic kernel runtimes. The device allows simultaneous load and store from any two edges of a device, plus a special property from one edge which can be used to send neural network weights – loading from two edges and storing from a third edge simultaneously can reduce compute execution runtimes.

Daniel Firu (Source: Quadric)

“You can load into one side and then store from a perpendicular side,” Firu said. “That allows for some pretty interesting stuff to happen at the software level. You can also start to do things like data re-mappings and rotations of images and things like that using this paradigm.”

Meanwhile, software-controlled static memories (not cache) on-chip offer space for large data structures. Quadric allows API access to these so developers can define arbitrary data structures inside. In the Q16 chip, the memories are 8 GB, enough to fit “two or three frame buffers at HD in there, or an entire neural network of weights,” said Firu.

Software stack

Quadric built its software stack before silicon. Customers have been using it with the company’s architecture simulator, or with FPGAs, for a year, Kheterpal said. Quadric’s stack abstracts away the architecture and instruction set through an LLVM-based compiler, with a C++ API on top.

Source Mode supports different data-parallel algorithms with source-level C++ control of the processor’s architectural features. As neural networks become more complex, Source Mode also allows developers to express custom operations.

Quadric’s software stack (Source: Quadric)

A future update to the stack will offer a no-code Graph Mode, which will support TensorFlow or ONNX versions of neural networks. That will include a TVM-based deep neural network (DNN) compiler which automatically generates code.

“We’re combining the power of no-code with the flexibility to have your own custom code, and combine them in interesting ways to achieve your application,” said Kheterpal. “Most platforms will only offer an AI-specific architecture with some kind of DNN compiler – but what about customization? What about a DNN that’s not supported? What about operators that are not supported? We don’t have those restrictions because this is a Turing complete core, the cores can do any operation. The code flexibility gives developers the ability to write whatever algorithm they want.”

Chip roadmap

Quadric’s Q16 chip, which features 256 Vortex cores in a 16 x 16 array in 16 nm silicon, offers 4 INT8 DNN TOPS. It can run ResNet-50 at 200 inferences per second (for INT8 parameters at 224 x 224 image size), consuming an average of 2W.

Quadric’s roadmap includes a second generation of the architecture, plus a tapeout of a Q32 chip (an array of 1,000 cores), “probably in 7 nm,” said Firu. While the Q16 is strictly an accelerator (it would sit alongside a system host processor) the Q32 under development may also include Arm or RISC-V cores to act as host.

An M.2-format developer kit, with a Q16 processor alongside 4 GB of external memory directly mapped to the Q16’s universal memory space, is available now.

>> This article was originally published on our sister site, EE Times.

Related Contents:

For more Embedded, subscribe to Embedded’s weekly email newsletter.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.