Benchmark scores highlight broad range of machine-learning inference performance -

Benchmark scores highlight broad range of machine-learning inference performance


MLPerf has released the first set of benchmark scores for its inference benchmark, following scores from the training benchmark which were released earlier this year.

Compared to the training round, which currently has 63 entries from 5 companies, many more companies submitted inference results based on MobileNet, ResNet, Yolo and other neural network architectures. In total there were more than 500 scores verified from 14 organisations. This included figures from several startups, while some high-profile startups were still noticeably absent.

In the closed division, whose strict conditions enable direct comparison of the systems, the results show a 5-order of magnitude difference in performance, and span three orders of magnitude in terms of estimated power consumption. In the open division, submissions can use a range of models, including low precision implementations.

Nvidia claimed the number one position for commercially available devices across all the categories in the closed division. Other leaders included Habana Labs, Google and Intel in the datacentre categories, while Nvidia competed with Intel and Qualcomm in the edge categories.


Nvidia’s EGX platform for data centre inference (Image: Nvidia)

“Nvidia is the only company that has the production silicon, software, programmability, and talent to publish benchmarks across the spectrum of MLPerf, and win in almost every category,” said Karl Freund, Analyst, Moor Insights and Strategy. “The programmability of GPUs uniquely positions them well for future MLPerf releases… I think this demonstrates the breadth of [Nvidia’s] strength, and also the niche nature of the challengers. But many of those challengers will mature over time, and so Nvidia will need to continue to innovate in both hardware and software.”

Nvidia published graphs showing its interpretation of the results, placing itself in the number one position across all four scenarios in the closed division for commercially available devices.

These scenarios represent different use cases. The offline and server scenarios are for inference in the data centre. The offline scenario might represent offline photo-tagging for a large number of pictures and measures pure throughput. The server scenario represents a use case with multiple requests from different users, submitting the requests at unpredictable times, and it measures throughput in a fixed time. The edge scenarios are single stream, which times inference for a single image such as in a mobile phone app, and multi-stream, which measures how many streams of images can be inferenced simultaneously, for multi-camera systems.

Companies can submit results for selected machine learning models performing image classification, object detection and language translation in each of the four scenarios.

Data center results

“Looking at the data center results, Nvidia topped on all five benchmarks for both the server and the offline categories,” said Paresh Kharya, Director of Product Management for Accelerated Computing, Nvidia. “Our Turing GPUs outperformed everyone else amongst the commercially available solutions.”

Kharya highlighted the fact that Nvidia was the only company to submit results across all five of the benchmark models for the data centre categories, and that for the server category (which is the more difficult scenario), Nvidia’s performance increased relative to its competitors.


Selected data centre benchmark results from closed division, leaders in the commercially available device category. Results are shown relative to Nvidia scores on a per-accelerator basis. X represents “no result submitted” (Image: Nvidia)

Nvidia’s closest competitor in the data centre sector is Israeli startup Habana Labs with its Goya inference chip.

“Habana stands as the only challenger with high performance silicon in full production, and should do well when the next MLPerf suite hopefully includes power consumption data,” said analyst Karl Freund.

In an interview with EETimes, Habana Labs pointed out that the benchmark scores are purely based on performance – power consumption is not a metric, nor is practicality (such as considering whether a solution is passively cooled or water cooled), nor is cost.


Habana Labs PCIe card featuring its Goya inference chip (Image: Habana Labs)

Habana also used the open division to show off its capability for low latency, restricting the latency further than for the closed division, and submitting results for the multi-stream scenario.

Edge Compute Results

For the edge benchmarks, Nvidia won all four of the categories that had submitters in the closed division for commercially available solutions. Qualcomm’s Snapdragon 855 SoC and Intel’s Xeon CPUs trailed Nvidia in the single-stream category, and neither Qualcomm or Intel submitted results for the more difficult multi-stream scenario.


Selected edge benchmark results from closed division, leaders in the commercially available device category. Results are shown relative to Nvidia scores on a per-accelerator basis. X represents “no result submitted” (Image: Nvidia)

Results for “preview” systems (those that are not yet commercially available) pitted Alibaba T-Head’s Hanguang chip against Intel’s Nervana NNP-I, the Hailo-8, and a reference design from Centaur Technologies. Meanwhile the R&D category featured a stealthy Korean startup, Furiosa AI, about which very little is known.

The recent Inference scores as well as earlier training scores are available on the MLPerf site.

>> An earlier version of this article was originally published on our sister site, EE Times.

Editor’s note: Originally published on EE Times, the article below by Sally Ward-Foxton offers further explanation of the information contained in the MLPerf inference scores available on the MLPerf site.

Understanding MLPerf Benchmark Scores

by Sally Ward-Foxton

What to look for and where to start.

If you follow the AI accelerator industry, you may have seen that MLPerf released a set of benchmark scores for their inference benchmarks yesterday. These scores are reported as a multi-page spreadsheet with figures ranging over five orders of magnitude. Most systems have scores in only some boxes, and systems which seem similar may have wildly different scores. To make things worse, higher or lower numbers can be better, depending on which category you’re looking at.

Habana Labs research scientist Itay Hubara was kind enough to explain the meaning of the different categories, divisions, models and scenarios in the MLPerf spreadsheet for me. Here’s what I learned.


The MLPerf scores spreadsheet is not straightforward to understand (Image: MLPerf)



Available means the system is currently on the market and available to purchase. The software stack has to be fully ready and submitters have to give the community the ability to reproduce their results. This means any code which isn’t in the company’s SDK has to go on to MLPerf’s Github.

Preview means the submitter will have this system available in the next round of MLPerf inference scores (expected early next summer). Submitters to this category are not required to hand over all their software.

Research, Development, Other means the system is still a prototype or is not intended for production, and submitters are not required to share any software.


The Closed Division is intended to allow direct, apples-to-apples comparison between systems, and the criteria companies have to stick to are strict, including a standard set of pre-trained model weights that have to be used.

In the Open Division , which Hubara describes as the “wild west,” submitters don’t need to follow most of the rules. However, they have to disclose what they have changed. This might be something like retraining the model, or fine tuning it.

Companies use this division to show off by letting their algorithm engineers loose. For example, Habana Labs showed open division scores which reduce the latency to a quarter of the latency for the closed division scores, to show off the capabilities of its Goya chip.


MobileNet-v1 and ResNet-50 v1.5 are for image classification with the ImageNet dataset used for inference. MobileNet is a lightweight network intended for mobile phones, while ResNet-50 is more heavyweight by comparison and is used by bigger accelerators.

SSD with MobileNet-v1 and SSD with ResNet-34 are for object detection. SSD refers to single shot detector, an algorithm for detecting individual objects for classification in a picture, but it has to work in partnership with a classification algorithm such as MobileNet or ResNet.

The MobileNet version is again a lighter weight model run on lower resolution pictures (300 x 300 or 0.09 Mpix). The ResNet-34 version inferences higher resolution images (1200 x 1200 or 1.44 Mpix).

These models use the Common Objects in Context (COCO) dataset for inference.

GNMT is the only benchmark that isn’t based on convolution or image processing. It’s a recurrent neural network for language translation (in this case, German to English).


There are four different scenarios, two for inference at the edge, and two for inference in the datacentre.

Single Stream simply measures the time taken to inference one image (batch size of 1), measured in milliseconds. In this category, a lower score is better. This scenario might correspond to a mobile phone that is performing inference on one image at a time.

Multi-Stream is the measurement of how many streams of images can be handled at once (batch size >1), with a latency between 50 to 100 milliseconds, depending on the model. Higher number in this category is better. Systems that do well here might end up in autonomous vehicles which use multiple cameras pointing in different directions, or in surveillance camera systems.

In the Server scenario, multiple users send queries to the system at random times. The metric is how many queries the system can support within a certain latency, when the streams are not constant like for the multi-stream scenario. It’s harder because batch size must be dynamic. Higher number is better.

The Offline scenario might be batch processing of images in a photo album where the data can be processed in any order. It’s not latency constrained. Instead, this scenario measures throughput of images measured in images per second. Higher number is better.

Number of accelerators

The benchmarks compare systems, not chips. Some systems might have one host chip and one accelerator chip, while the largest had 128 Google TPU accelerator chips. The scores are not normalised per accelerator, since the host also plays a part, but they are roughly linear with the number of accelerators.

Why are some scores blank?

There is no requirement to submit results to every scenario or every model, or even groups of models. Devices intended for edge platforms might choose to submit only for the single-stream and multi-stream scenarios, while data centre platforms might choose to submit only for the server and offline scenarios. Clearly, each company has chosen to submit scores which it thinks will show its system in the best light.

There are also other factors at play. For example, Hubara also explained that one of Habana’s scores is blank because the company missed the submission deadline for this round.

Also, there were fewer submissions for the GNMT translation model, which is now widely seen as out of date, with many companies preferring to spend time implementing a newer algorithm such as BERT.

Caveat emptor

Overall, the scores measure pure performance, but selecting one system for a practical application of course requires consideration of many other factors.

For example, there’s no power measurement in this set of scores (this is rumoured to be coming in the next version of the benchmark).

Cost is also not indicated. Obviously if one system has one accelerator chip and one has 128, there will be a price difference. The spreadsheet also lists the host CPU used for each system, which can add significant cost. Some may also require expensive water cooling.

The Form Factor categories (mobile/handheld, desktop/workstation, server, edge/embedded) are indications given by the system manufacturer. They are not strictly part of the benchmark as there are no criteria for each category.

Clicking on the details link on the right hand side of the spreadsheet for every system should take you to some further details about the system’s hardware and software which are worth looking at. Some of these fields are mandatory, some are not, but this may shed light on system requirements such as cooling.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.