Distributed architectures offer effective solution for AI workloads

Artificial intelligence (AI) has gone from the stuff of science fiction to an integral part of our lives in a relatively short time. When you think of AI, you might picture self-driving cars or computers capable of surpassing humans in chess, Go, or “ Jeopardy.” The reality is that you’ll find AI applications everywhere – in customized Google News feeds, Pandora playlists, Netflix recommendations, smart speaker voice recognition, natural language processing in smart assistants, computer vision in vehicles, smart factories – there are countless more examples. When you buy from Amazon, machine learning (ML) is working behind the scenes, from making buying recommendations to reducing click-to-ship time to just 15 minutes [1].

Special Project Logo - 1000 px Editor's Note: This article is part of an AspenCore Special Project on application of AI at the edge.

As AI applications have become important to consumers, billions of dollars are now at stake in the business world. For example, 97% of mobile phone users use AI-powered voice assistants [2]. A misinterpreted voice command by Siri or Cortana may be a minor annoyance to us, but losing in the voice assistant market represents billions of dollars of lost share [3] in the battle between Apple, Amazon, and Google. And there are even more serious challenges – potentially fatal outcomes and legal implications from erroneous self-driving algorithms [4] or a misdiagnosis in the health care industry [5].

There is a race to make AI results relevant, reliable, and readily available. Only those with AI models trained on the best machine/deep learning infrastructure, from the largest data sets, will survive.

ML/deep learning: Not your average compute workload

ML – and especially its subset, deep learning systems – forms the basis of the AI infrastructure. Setting aside intricate mathematics, at their simplest ML algorithms realize an objective (for example, the successful recognition of a handwritten symbol) by making repeated “guesses” at the answer, and learning from each inaccurate guess by checking against the expected answer until the guess matches the expected answer with very high accuracy. This feedback structure is called a neural network, and training neural networks is the process of machine/deep learning. Figure 1 shows an example of a relatively simple neural network used for handwriting recognition [6].


Figure 1
Example of a neural network for handwriting recognition [6]

Deep neural networks use many more layers in order to get accurate answers to complex objectives. The deep learning process uses ever-increasing training data sets to train deep neural networks. The more complex the objective, the more layers there are in the neural network, and the more difficult the neural network is to train. For example, Baidu’s Chinese speech recognition models use ~12,000 hours of speech training data and require tens of exaflops of calculations, which take as long as six weeks to complete [7]. Compute requirements are exponentially higher for image recognition workloads.

Traditional central processing units (CPUs) are designed for general-purpose control data flow and are not efficient for AI/ML computation-intensive workloads. And with Moore’s Law failing [8, 9], vendors cannot keep up with CPUs fast enough or big enough to handle AI/ML workloads.

Distributed ML: A cure for Moore’s Law

A modern server designed to handle AI/ML workloads follows a decentralized architecture – a general-purpose CPU surrounded by multiple specialized accelerators to handle tasks from ML to encryption, security, storage, and networking. The accelerator may be a combination of graphics processing units (GPUs), customized field-programmable gate arrays (FPGAs), or custom-built application-specific integrated circuits. The Open Compute Project (OCP) [10] recently released a common form-factor specification for OCP accelerator modules (OAM) [11] to simplify server design and enable a modular server architecture.

A decentralized architecture delivers raw exaflops by using multiple optimized data processors. In order to enable larger-scale ML, however, the processing units need to be adequately connected with each other. A presentation at the 2018 Symposium on Principles of Distributed Computing demonstrated a nearly 10× speed improvement in ResNet-152 image classification using TensorFlow [12].

The ResNet-152 image classification example shown in Figure 2 also highlights the importance of connectivity in modern, highly distributed ML systems, where as much as 90% of the time may be spent in node communication.

Figure 2 Benefits of distributed ML – 19 days to 2.4 days [12]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.