|
![]() Charles Fan, Steve Scargall and Bernie Wu presented for MemVerge at AI Field Day 6 |
This Presentation date is January 29, 2025 at 14:00-15:30.
Presenters: Bernie Wu, Charles Fan, Steve Scargall, Steve Yatko
At AI Field Day, MemVerge will present an in-depth exploration of their software solutions for optimizing AI infrastructure. Charles Fan will outline key challenges in AI infrastructure software, including inefficiencies in resource utilization and the need for robust memory management. Steve Scargall provides an overview of MemVerge’s innovative approach with Memory Machine AI, designed to enhance performance and reliability. Bernie Wu offers a deep dive into their advanced checkpointing software, tailored for both CPU and GPU-based AI workflows, and shares a roadmap for future developments. Steven Yatko of Oktay concludes with insights from a customer perspective, highlighting real-world applications and benefits.
Supercharging AI Infra with MemVerge Memory Machine AI
Watch on YouTube
Watch on Vimeo
Dr. Charles Fan’s presentation at AI Field Day 6 provided an overview of large language models (LLMs), agentic AI applications, and workflows for AI workloads, focusing on the impact of agentic AI on data center technology and AI infrastructure software. He highlighted the two primary ways enterprises are currently deploying AI: leveraging API services from providers like OpenAI and Anthropic, and deploying and fine-tuning open-source models within private environments for data privacy and cost savings. Fan emphasized the recent advancements in open-source models, particularly DeepSeek, which significantly reduces training costs, making on-premise deployment more accessible for enterprises.
The core of Fan’s presentation centered on MemVerge’s solution to address the challenges of managing and optimizing AI workloads within the evolving data center architecture. This architecture is shifting from an x86-centric model to one dominated by GPUs and high-bandwidth memory, necessitating a new layer of AI infrastructure automation software. MemVerge’s software focuses on automating resource provision, orchestration, and optimization, bridging the gap between enterprise needs and the complexities of the new hardware landscape. A key problem addressed is the low GPU utilization in enterprises due to inefficient resource sharing, which MemVerge aims to improve through their “GPU-as-a-service” offering.
MemVerge’s “GPU-as-a-service” solution acts as an orchestrator, improving resource allocation and utilization, addressing the lack of effective virtualization for GPUs. This includes features like transparent checkpointing to minimize data loss during workload preemption and multi-vendor support for GPUs. Their upcoming Memory Machine AI platform will also encompass inference-as-a-service and fine-tuning-as-a-service, further simplifying the deployment and management of open-source models within private enterprise environments. Fan concluded by announcing a pioneer program to engage early adopters and collaborate on refining the platform to meet specific enterprise needs.
Personnel: Charles Fan
MemVerge Memory Machine AI GPU-as-a-Service
Watch on YouTube
Watch on Vimeo
Steve Scargall introduces Memory Machine AI (MMAI) software from MemVerge, a platform designed to optimize GPU usage for platform engineers, data scientists, developers, MLOps engineers, decision-makers, and project leads. The software addresses challenges in strategic resource allocation, flexible GPU sharing, real-time observability and optimization, and priority management. MMAI allows users to request specific GPU types and quantities, abstracting away the complexities of underlying infrastructure. Users interact with the platform through familiar environments like VS Code and Jupyter notebooks, simplifying the process of launching and managing AI workloads.
A key feature of MMAI is its “GPU surfing” capability, which enables the dynamic movement of workloads between GPUs based on resource availability and priority. This is facilitated by MemVerge’s checkpointing technology, allowing seamless transitions without requiring users to manually manage or even be aware of the location of their computations. The platform supports both on-premises and cloud deployments, initially focusing on Kubernetes but with planned support for Slurm and other orchestration systems. This flexibility allows for integration with existing enterprise infrastructure and workflows, providing a path for organizations of various sizes and technical expertise to leverage their GPU resources more efficiently.
MMAI offers a comprehensive UI providing real-time monitoring and telemetry for both administrators and end-users. Features include departmental billing, resource sharing and bursting, and prioritized job execution. The software supports multiple GPU vendors (Nvidia initially, with AMD and Intel planned), allowing for heterogeneous environments. The presentation highlights the potential for future AI-driven scheduling and orchestration based on the rich telemetry data collected by MMAI, demonstrating a commitment to continuous improvement and optimization of GPU resource utilization in complex, multi-departmental settings. The business model is based on the number of managed GPUs.
Personnel: Steve Scargall
MemVerge Memory Machine AI Transparent Checkpointing
Watch on YouTube
Watch on Vimeo
Bernie Wu’s presentation at AI Field Day 6 detailed MemVerge’s transparent checkpointing technology for AI workloads, addressing limitations of existing checkpointing methods. This technology, implemented as an MMAI Kubernetes operator, enables efficient pausing and relocation of long-running GPU tasks without requiring application modifications or awareness. This contrasts with other schedulers that necessitate application-level changes or cold restarts, significantly improving resource management and reducing friction for users.
The core of MemVerge’s approach is its ability to perform transparent checkpointing at the platform level, distinct from application-level checkpointing found in frameworks like PyTorch and TensorFlow. While the latter focuses on model optimization and rollback within the data scientist’s workflow, MemVerge’s solution targets site reliability engineers and platform engineers, handling tasks like graceful node maintenance, elastic workload bin-packing, and reclaiming idle resources, including spot instances. The technology initially developed for CPUs has been extended to GPUs through collaboration with NVIDIA, leveraging a two-stage checkpoint/restore process and techniques like incremental memory snapshots and asynchronous checkpointing to minimize overhead.
Future developments include parallelizing the checkpointing process for improved performance, extending support to AMD GPUs and multi-GPU nodes, and enabling cluster-wide checkpointing for distributed training and inferencing. MemVerge also plans to integrate their solution with other schedulers and expand its use cases to encompass hybrid cloud scheduling, heterogeneous pipelines, and HPC environments, further streamlining AI workload management and enhancing operational efficiency.
Personnel: Bernie Wu
MemVerge Fireside Chat with Steve Yatko of Oktay
Watch on YouTube
Watch on Vimeo
Charles Fan and Steve Yatko discussed enterprise experiences with AI application and infrastructure deployments. The conversation highlighted the challenges faced by organizations adopting generative AI, particularly the unpreparedness for the rapid advancements and the need for strategic planning. Key challenges revolved around defining appropriate use cases for generative AI, maximizing business value and revenue generation, and effectively managing confidential data within AI initiatives. The discussion also touched upon simpler issues like improving developer productivity and documentation.
A central theme emerging was the critical need for manageable AI application workloads and efficient resource utilization. Steve Yatko, drawing on his extensive experience in financial services and technology, emphasized the importance of dynamic resource management, similar to the evolution of virtualization technology. He highlighted the limitations of existing approaches and the advantages offered by MemVerge’s technology in enabling seamless resource allocation, mobilization of development and testing environments, and efficient cost optimization. This included the ability to create internal spot markets for resources, thereby maximizing utilization and sharing across departments.
Yatko specifically praised MemVerge’s technology for its ability to address the critical challenges facing enterprises in the AI space, particularly in financial services. He noted the ability to checkpoint and restore workloads across nodes, enabling greater flexibility and resilience. The solution’s support for multi-cloud, hybrid cloud, and diverse GPU configurations makes it particularly relevant for organizations needing adaptable and scalable solutions. Overall, the presentation positioned MemVerge’s platform as a crucial component for enterprises to efficiently and cost-effectively deploy and manage AI applications at scale, ultimately unlocking productivity and driving greater business value.
Personnel: Charles Fan, Steve Yatko