Standardizing Gen Al Service Evaluation, An API-Centric Benchmarking Approach with David Kanter

This video is part of the appearance, "AI Field Day 8 Community Presentations". It was recorded as part of AI Field Day 8 at 10:30-11:30 on May 14, 2026.

Watch on YouTube
Watch on Vimeo

David Kanter detailed the ongoing evolution of MLPerf benchmarks, which have been an industry standard for seven years. He highlighted the need for fundamental changes, particularly in the visualization of results, moving from an outdated, spreadsheet-like format to a more modern and understandable interface. MLPerf, backed by MLCommons, is widely used by over 100 members for internal testing, showcasing capabilities, and informing purchasing decisions. Its success stems from core principles of relevance, fairness, neutrality, reproducibility, and inclusiveness, all working together to foster trust and drive industry advancement.

The landscape of AI performance has radically shifted with the explosion of generative AI, marked by immense user adoption and an unprecedented velocity of change, with new models appearing almost fortnightly. To keep pace and better serve buyers, MLPerf is transitioning to an API-centric benchmarking approach. This involves moving away from a complex, locally installed load generator to a decoupled, Python-based test infrastructure that interacts with the system under test via a standard API, similar to the OpenAI API. This new architecture simplifies setup, accelerates the integration of new datasets and benchmarks, and supports comprehensive measurement across varying concurrency levels, capturing critical metrics like time-to-first token, throughput, and full response latency without relying on interpolation.

This strategic shift aims to significantly increase the velocity of benchmark submissions, allowing for more frequent updates than the current six-month cycle, while rigorously maintaining peer review and auditability to preserve trust. Kanter acknowledged the complex and multidimensional challenge of assessing quality in generative AI and agentic applications, a problem MLPerf is actively addressing in its long-term roadmap. He concluded by inviting feedback from the community, especially from enterprise buyers and analysts, to ensure the benchmarks remain relevant, understandable, and valuable for the widespread deployment of generative AI.

Personnel: David Kanter

Fortinet Oddly Puts LCD Screens and LoraWAN on Wi-Fi 7 APs at MFD14

HPE Bets on Standard Power to Fix 6 GHz’s Weakest Link

Is Object Storage Becoming Part of the AI Memory Hierarchy?

Big Branch Improvements from Cisco

The New Governance Control Plane for Enterprise AI

AIOps Tools: Forward

Standardizing Gen Al Service Evaluation, An API-Centric Benchmarking Approach with David Kanter

Sign up for updates to Tech Field day events

Sign up for updates to
Tech Field day events