|
![]() Curtis Anderson and David Kanter presented for ML Commons at AI Field Day 6 |
MLCommons and MLPerf – An Introduction
Watch on YouTube
Watch on Vimeo
MLCommons is a non-profit industry consortium dedicated to improving AI for everyone by focusing on accuracy, safety, speed, and power efficiency. The organization boasts over 125 members across six continents and leverages community participation to achieve its goals. A key project is MLPerf, an open industry standard benchmark suite for measuring the performance and efficiency of AI systems, providing a common framework for comparison and progress tracking. This transparency fosters collaboration among researchers, vendors, and customers, driving innovation and preventing inflated claims.
The presentation highlights the crucial relationship between big data, big models, and big compute in achieving AI breakthroughs. A key chart illustrates how AI model performance significantly improves with increased data, but eventually plateaus. This necessitates larger models and more powerful computing resources, leading to an insatiable demand for compute power. MLPerf benchmarks help navigate this landscape by providing a standardized method of measuring performance across various factors including hardware, algorithms, software optimization, and scale, ensuring that improvements are verifiable and reproducible.
MLPerf offers a range of benchmarks covering diverse AI applications, including training, inference (data center, edge, mobile, tiny, and automotive), storage, and client systems. The benchmarks are designed to be representative of real-world use cases and are regularly updated to reflect technological advancements and evolving industry practices. While acknowledging the limitations of any benchmark, the presenter emphasizes MLPerf’s commitment to transparency and accountability through open-source results, peer review, and audits, ensuring that reported results are not merely flukes but can be validated and replicated. This approach promotes a collaborative, data-driven approach to developing more efficient and impactful AI solutions.
Personnel: David Kanter
MLCommons MLPerf Client Overview
Watch on YouTube
Watch on Vimeo
MLCommons presented MLPerf Client, a new benchmark designed to measure the performance of PC-class systems, including laptops and desktops, on large language model (LLM) tasks. Released in December 2024, it’s an installable, open-source application (available on GitHub) that allows users to easily test their systems and provides early access for feedback and improvement. The initial release focuses on a single large language model, LLaMA 2.7 billion, using the Open Orca dataset, and includes four tests simulating different LLM usage scenarios like content generation and summarization. The benchmark prioritizes response latency as its primary metric, mirroring real-world user experience.
A key aspect of MLPerf Client is its emphasis on accuracy. While prioritizing performance, it incorporates the MMLU (Massive Multitask Language Understanding) benchmark to ensure the measured performance is achieved with acceptable accuracy. This prevents optimizations that might drastically improve speed but severely compromise the quality of the LLM’s output. The presenters emphasized that this is not intended to evaluate production-ready LLMs, but rather to provide a standardized and impartial way to compare the performance of different hardware and software configurations on common LLM tasks.
The benchmark utilizes a single-stream approach, feeding queries one at a time, and supports multiple GPU acceleration paths via ONNX Runtime and Intel OpenVINO. The presenters highlighted the flexibility of allowing hardware vendors to optimize the model (LLaMA 2.7B) for their specific devices, even down to 4-bit integer quantization, while maintaining sufficient accuracy as judged by the MMLU threshold. Future plans include expanding hardware support, adding more tests and models, and implementing a graphical user interface (GUI) to improve usability.
Personnel: David Kanter
MLCommons MLPerf Storage
Watch on YouTube
Watch on Vimeo
MLCommons’ MLPerf Storage benchmark addresses the rapidly growing need for high-performance storage in AI training. Driven by the exponential increase in data volume and the even faster growth in data access demands, the benchmark aims to provide a standardized way to compare storage systems’ capabilities for AI workloads. This benefits purchasers seeking informed decisions, researchers developing better storage technologies, and vendors optimizing their products for AI’s unique data access patterns, which are characterized by random reads and massive data volume exceeding the capacity of most on-node storage solutions.
The benchmark currently supports three training workloads (UNET 3D, ResNet-50, and CosmoFlow) using PyTorch and TensorFlow, each imposing distinct demands on storage systems. Future versions will incorporate additional workloads, including a RAG (Retrieval Augmented Generation) pipeline with a vector database, reflecting the evolving needs of large language model training and inference. A key aspect is the focus on maintaining high accelerator utilization (aiming for 95%), making the storage system’s speed crucial for avoiding costly GPU idle time. The benchmark offers both “closed” (apples-to-apples comparisons) and “open” (allowing for vendor-specific optimizations) categories to foster innovation.
MLPerf Storage has seen significant adoption since its initial release, with a substantial increase in the number of submissions and participating organizations. This reflects the growing importance of AI in the market and the need for a standardized benchmark for evaluating storage solutions designed for these unique demands. The benchmark’s community-driven nature and transparency are enabling more informed purchasing decisions, moving beyond arbitrary vendor claims and providing a more objective way to assess the performance of storage systems in the critical context of modern AI applications.
Personnel: Curtis Anderson, David Kanter