NVIDIA Touts AI Inference Benchmarks Wins

November 7, 2019

After introducing the first industry-standard inference benchmarks in June of 2019, today the MLPerf consortium released 595 inference benchmark results from 14 organizations. These benchmarks measure how quickly a trained neural network can process new data for a wide range of applications (autonomous driving, natural language processing, and many more) on a variety of form factors (IoT devices, smartphones, PCs, servers and a variety of cloud solutions). The results of the benchmarks are available on the MLPerf website at

“All released results have been validated by the audits we conducted,” stated Guenther Schmuelling, MLPerf Inference Results Chair from Microsoft. “We were very impressed with the quality of the results. This is an amazing number of submissions in such a short time since we released these benchmarks this summer. It shows that inference is a growing and important application area, and we expect many more submissions in the months ahead.”

“Companies are embracing these benchmark tests to provide their customers with an objective way to measure and compare the performance of their machine learning solutions,” stated Carole-Jean Wu, Inference Co-chair from Facebook. “There are many cost-performance tradeoffs involved in inference applications. These results will be invaluable for companies evaluating different solutions.”

Of the 595 benchmark results released today, 166 are in the Closed Division intended for direct comparison of systems. The results span 30 different systems. The benchmarks show a 4-order-of-magnitude difference in performance and a 3-order-of-magnitude range in estimated power consumption and range from embedded devices and smartphones to large-scale data center systems. The remaining 429 open results are in the Open Division and show a more diverse range of models, including low precision implementations and alternative models.

Companies in China, Israel, Korea, the United Kingdom, and the United States submitted benchmark results. These companies include: Alibaba, Centaur Technology, Dell EMC, dividiti, FuriosaAI, Google, Habana Labs, Hailo, Inspur, Intel, NVIDIA, Polytechnic University of Milan, Qualcomm, and Tencent.

“As an all-volunteer open-source organization, we want to encourage participation from anyone developing an inference product, even in the research and development stage,” stated Christine Cheng, Inference Co-chair. “You are welcome to join our forum, join working groups, attend meetings, and raise any issues you find.”

According to David Kanter, Inference and Power Measurement Co-chair, “We are very excited about our roadmap, future versions of MLPerf will include additional benchmarks such as speech-to-text and recommendation, and additional metrics such as power consumption.”

“MLPerf is also developing a smartphone app that runs inference benchmarks for use with future versions. We are actively soliciting help from all our members and the broader community to make MLPerf better,” stated Vijay Janapa Reddi, Associate Professor, Harvard University, and MLPerf Inference Co-chair.

MLPerf’s five inference benchmarks — applied across a range of form factors and four inferencing scenarios — cover such established AI applications as image classification, object detection and translation.

NVIDIA topped all five benchmarks for both data center-focused scenarios (server and offline), with Turing GPUs providing the highest performance per processor among commercially available entries1. Xavier provided the highest performance among commercially available edge and mobile SoCs under both edge-focused scenarios (single-stream and multi-stream)2.

“AI is at a tipping point as it moves swiftly from research to large-scale deployment for real applications,” said Ian Buck, general manager and vice president of Accelerated Computing at NVIDIA. “AI inference is a tremendous computational challenge. Combining the industry’s most advanced programmable accelerator, the CUDA-X suite of AI algorithms and our deep expertise in AI computing, NVIDIA can help data centers deploy their large and growing body of complex AI models.”

Highlighting the programmability and performance of its computing platform across diverse AI workloads, NVIDIA was the only AI platform company to submit results across all five MLPerf benchmarks. In July, NVIDIA won multiple MLPerf 0.6 benchmark results for AI training, setting eight records in training performance.

NVIDIA GPUs accelerate large-scale inference workloads in the world’s largest cloud infrastructures, including Alibaba Cloud, AWS, Google Cloud Platform, Microsoft Azure and Tencent. AI is now moving to the edge at the point of action and data creation. World-leading businesses and organizations, including Walmart and Procter & Gamble, are using NVIDIA’s EGX edge computing platform and AI inference capabilities to run sophisticated AI workloads at the edge.

All of NVIDIA’s MLPerf results were achieved using NVIDIA TensorRT™ 6 high-performance deep learning inference software that optimizes and deploys AI applications easily in production from the data center to the edge. New TensorRT optimizations are also available as open source in the GitHub repository.

New Jetson Xavier NX
Expanding its inference platform, NVIDIA today introduced Jetson Xavier NX, the world’s smallest, most powerful AI supercomputer for robotic and embedded computing devices at the edge. Jetson Xavier NX is built around a low-power version of the Xavier SoC used in the MLPerf Inference 0.5 benchmarks.

Google's TPU Technical Program Manager Pankaj Kanwar noted:

MLPerf is the industry standard for measuring ML performance, and results from the new MLPerf Inference benchmarks are now available. These benchmarks represent performance across a variety of machine learning prediction scenarios. Our submission demonstrates that Google’s Cloud TPU platform addresses the critical needs of machine learning customers: developer velocity, scalability, and elasticity. 

MLPerf Inference v0.5 defines three datacenter-class benchmarks: ResNet-50 v1.5 for image classification, SSD-ResNet-34 for object detection, and GNMT for language translation. Google submitted results for all three of these benchmarks using Cloud TPU v3 devices and demonstrated near-linear scalability all the way up to a record 1 million images processed per second on ResNet-50 v1.5 using 32 Cloud TPU v3 devices.

Peak demonstrated scaling for select MLPerf v0.5.png

Peak demonstrated scaling for select MLPerf v0.5 Closed offline submission normalized to the highest entry. *2

Cloud TPUs are publicly available to Google Cloud customers in beta. These same TPUs are also being used throughout numerous large-scale Google products, including Google Search

Developer velocity: Serve what you train 
The Cloud TPU architecture is designed from the ground up to more seamlessly move ML workloads from training to serving. Cloud TPUs offer
bfloat16 floating-point numerics, which allow for greater accuracy compared to integer numerics. Training and serving on the same hardware platform helps prevent potential accuracy losses at inference time and does not require quantization, recalibration, or retraining. In contrast, serving with low precision (e.g., 8-bit) numerics can create major complexities that require significant developer investment to overcome. For example, quantizing a model can add weeks of effort and risk to a project, and it is not always possible for a quantized model to achieve the same accuracy as the original. Inference hardware is lower-cost relative to ML developer effort, so increasing development velocity by serving ML models in higher precision can help save money and improve application quality.

For example, using the TPU v3 platform for both training and inference allows Google Translate to push new models to production within hours of model validation. This enables the team to deploy new advances from machine translation research into production environments faster by eliminating the engineering time required to develop custom inference graphs. This same technology is available to Google Cloud customers to increase the productivity of their machine learning teams, accelerating the development of popular use cases such as call center solutions, document classification, industrial inspection, and visual product search.

Inference at scale
Machine learning inference is highly parallel, with no dependency between one input and the next. MLPerf Inference v0.5 defines two different datacenter inference scenarios: “offline” (e.g. processing a large batch of data overnight) and “online” (e.g. responding to user queries in real-time). Our offline submissions leverage large-scale parallelism to demonstrate high scalability across all three datacenter-class benchmarks. In the case of ResNet-50 v1.5, we show near linear scalability going from 1 to 32 Cloud TPU devices. Google Cloud customers can use these MLPerf results to assess their own needs for inference and choose the Cloud TPU hardware configuration that fits their inference demand appropriately.

Google Cloud TPU v3.png

Google Cloud TPU v3 speed-ups as demonstrated by Google’s MLPerf Inference 0.5 Closed submission. Results in this figure are drawn from the offline scenario.*3

Cloud elasticity: On-demand provisioning
Enterprise inference workloads have time-varying levels of demand for accelerator resources. Google Cloud offers the elasticity needed to adapt to fluctuating demand by provisioning and de-provisioning resources automatically while minimizing cost. Whether customers serve intermittent queries for internal teams, thousands of globally distributed queries every second, or run a giant batch inference job every night, Google Cloud allows them to have just the right amount of hardware to match their demand, minimizing waste due to underutilization of resources.

For example, the Cloud TPU ResNet-50 v1.5 offline submission to MLPerf Inference v0.5 Closed demonstrates that just 32 Cloud TPU v3 devices can collectively process more than one million images per second. To understand that scale and speed, if all 7.7 billion people on Earth uploaded a single photo, you could classify this entire global photo collection in under 2.5 hours and do so for less than $600. With this performance, elasticity and affordability, Google Cloud is uniquely positioned to serve the machine learning needs of enterprise customers.

Terms of Use | Copyright © 2002 - 2019 CONSTITUENTWORKS SM  CORPORATION. All rights reserved. | Privacy Statement