Burgeoning machine learning and AI emerging use cases promise to create significant value for industries via accelerated information processing and increased accuracy of decision-making. But machine learning models are compute-intensive, demand high-frequency, and real-time AI analysis scenarios, which has led enterprises to lean on performance guidance using the metric Trillions of Operations Per Second (TOPS). TOPS captures “how many mathematical operations can an accelerator deliver in one second?” to compare and identify the best accelerator for a given inference task.
While TOPS is an ‘easy’ metric to calculate, it often falls short of providing reliable performance indicators for real-world workloads. Limited by the number of multipliers and adders in an accelerator, the metric fails to account for computational hardware structures that process neural network models. As data network models process data faster, how can businesses scale with faster and reliable decision-making particularly at the edge?
In this post, we will review TOPS, its challenges with measuring latency and how it differs than real-world performance calculations and offer an alternative approach to calculate performance through benchmarking, which offers a more reliable way to account for computational hardware structures.
Reality of TOPS as a Performance Measure
TOPS is a simplifying metric: It tells you how many computing operations an AI accelerator can handle in one second at 100% utilization. In essence, it looks at how many math operation problems an accelerator can solve in a very short period of time.
For example, if an AI accelerator offers 5 TOPS and another offers 15 TOPS, it is inferred that the latter is three times faster than the former. But much like Megahertz, and Gigahertz for CPU speed, TOPS too, has lost relevance in determining overall computer performance. As interest in AI applications has grown, the latest AI accelerator can process data faster and with more complexity than simple arithmetic.
TOPS, however, rarely accurately captures the importance an AI processor has in an overall hardware device. Today, AI processors in cameras, edge servers, and computers are typically one of the key components in determining both compute horsepower and energy efficiency. In fact, TOPS fail to account for real world workload. Often, the real-world performance can be significantly lower than the TOPS value, due to factors such as idle computer units waiting for the data from memory, synchronization overhead between different parts of the accelerator and control overhead. Depending on the accelerator’s architecture and workload characteristics, an accelerator might only achieve 5-10% of its theoretical TOPS value because an accelerator’s architecture and software tools play an important part in determining how well it utilizes its computation resources while executing a workload.
Higher TOPS Does Not Equal Higher Performance
While a higher TOPS value can signal a larger AI accelerator with more compute elements, the reality can be the opposite. A higher TOPS generally leads to a larger accelerator with more compute elements and memory blocks to feed data to those compute units which results in higher costs and power. An efficient accelerator, on the other hand, offers higher performance using a lower number of compute resources and therefore, is a lower TOPS rating. Ultimately, a desirable AI accelerator is the one which provides high performance using low TOPS.
TOPS Doesn’t Include All Computational Types
The TOPS metric considers an accelerator’s multipliers and adders, which often leads to an inaccurate performance metric as an accelerator can have other computation resources beyond that. For example, Kinara’s architecture employs reduction trees instead of an adder array, resulting in significantly lower energy consumption. By not capturing the reduction tree’s computation capability in this computation, the TOPS metrics will be less than accurate. Standard neural networks such as ResNet50, MobileNet V1, and YOLO_v3, are useful when comparing different accelerators as they can also be used as a proxy for ‘guesstimating’ whether a given accelerator can meet the demands of a developer’s own workloads.
Inference Latency is the Metric to Evaluate AI Accelerator Performance
For businesses making investments in Edge AI, calculating performance through benchmarking offers a reliable way to account for computational hardware structures versus TOPS. With most real-world applications requiring a blazing fast inference time, the best way to measure performance is to run a specific workload, typically ResNet-50, EfficientDet, a Transformer or a custom model to understand an accelerators efficiency. Using networks of different types, sizes, topologies, and input resolutions to do real-time processing, the inference latency metric can be derived. This metric calculates the accelerator’s execution time to complete one interference of a specific AI model.
As AI workloads and their supporting computing architectures evolve, ensuring their predictability through accurate performance measurements has a significant impact that can lead developers to more optimal decision making. By using the inference latency calculation, it is helping to process and predict data flow in modern AI workloads, even as these workloads fragment and new architectures develop leading to even more unpredictability. Ultimately, benchmarking applications offer a credible and more reliable alternative to TOPS, and AI accelerators support more efficient evaluations.
|Rehan Hameed is CTO and Co-Founder at Kinara (formerly Deep Vision). Rehan received his Ph.D in Electrical Engineering at Stanford where he focused his research on computer vision and low-power processors.|
- Benchmarks show AI performance on tiny systems
- Benchmarking OpenCV on STM32 MCUs
- Benchmark helps sort out ADAS SoC metrics
- Benchmark scores highlight broad range of machine-learning inference performance
For more Embedded, subscribe to Embedded’s weekly email newsletter.