Tech Field Day

The Independent IT Influencer Event

  • Home
    • The Futurum Group
    • FAQ
    • Staff
  • Sponsors
    • Sponsor List
      • 2026 Sponsors
      • 2025 Sponsors
      • 2024 Sponsors
      • 2023 Sponsors
      • 2022 Sponsors
    • Sponsor Tech Field Day
    • Best of Tech Field Day
    • Results and Metrics
    • Preparing Your Presentation
      • Complete Presentation Guide
      • A Classic Tech Field Day Agenda
      • Field Day Room Setup
      • Presenting to Engineers
  • Delegates
    • Delegate List
      • 2025 Delegates
      • 2024 Delegates
      • 2023 Delegates
      • 2022 Delegates
      • 2021 Delegates
      • 2020 Delegates
      • 2019 Delegates
      • 2018 Delegates
    • Become a Field Day Delegate
    • What Delegates Should Know
  • Events
    • All Events
      • Upcoming
      • Past
    • Field Day
    • Field Day Extra
    • Field Day Exclusive
    • Field Day Experience
    • Field Day Live
    • Field Day Showcase
  • Topics
    • Tech Field Day
    • Cloud Field Day
    • Mobility Field Day
    • Networking Field Day
    • Security Field Day
    • Storage Field Day
  • News
    • Coverage
    • Event News
    • Podcast
  • When autocomplete results are available use up and down arrows to review and enter to go to the desired page. Touch device users, explore by touch or with swipe gestures.
You are here: Home / Videos / MemVerge Memory Machine AI Transparent Checkpointing

MemVerge Memory Machine AI Transparent Checkpointing



AI Field Day 6


This video is part of the appearance, “MemVerge Presents at AI Field Day 6“. It was recorded as part of AI Field Day 6 at 14:00-15:30 on January 29, 2025.


Watch on YouTube
Watch on Vimeo

Bernie Wu’s presentation at AI Field Day 6 detailed MemVerge’s transparent checkpointing technology for AI workloads, addressing limitations of existing checkpointing methods. This technology, implemented as an MMAI Kubernetes operator, enables efficient pausing and relocation of long-running GPU tasks without requiring application modifications or awareness. This contrasts with other schedulers that necessitate application-level changes or cold restarts, significantly improving resource management and reducing friction for users.

The core of MemVerge’s approach is its ability to perform transparent checkpointing at the platform level, distinct from application-level checkpointing found in frameworks like PyTorch and TensorFlow. While the latter focuses on model optimization and rollback within the data scientist’s workflow, MemVerge’s solution targets site reliability engineers and platform engineers, handling tasks like graceful node maintenance, elastic workload bin-packing, and reclaiming idle resources, including spot instances. The technology initially developed for CPUs has been extended to GPUs through collaboration with NVIDIA, leveraging a two-stage checkpoint/restore process and techniques like incremental memory snapshots and asynchronous checkpointing to minimize overhead.

Future developments include parallelizing the checkpointing process for improved performance, extending support to AMD GPUs and multi-GPU nodes, and enabling cluster-wide checkpointing for distributed training and inferencing. MemVerge also plans to integrate their solution with other schedulers and expand its use cases to encompass hybrid cloud scheduling, heterogeneous pipelines, and HPC environments, further streamlining AI workload management and enhancing operational efficiency.

Personnel: Bernie Wu

  • Bluesky
  • LinkedIn
  • Mastodon
  • RSS
  • Twitter
  • YouTube

Event Calendar

  • Oct 29-Oct 30 — AI Field Day 7
  • Nov 5-Nov 6 — Networking Field Day 39
  • Nov 11-Nov 12 — Tech Field Day at KubeCon North America 2025
  • Jan 28-Jan 29 — AI Infrastructure Field Day 4
  • Mar 11-Mar 12 — Cloud Field Day 25
  • Apr 29-Apr 30 — Security Field Day 15
  • May 6-May 8 — Mobility Field Day 14

Latest Coverage

  • The Metamorphosis of Trust
  • HPE’s AI Factory: Bridging the Gap from Pilot to Production
  • Pure Storage Offers Prompt-Based Control, Smooth Infra Management
  • Unifying Storage Management: Pure Fusion & Pure Storage Cloud at Cloud Field Day 24
  • Oxide Delivers Couture Hyperscale Infra for the Enterprise

Tech Field Day News

  • Exploring the Future of Enterprise AI Deployment and Innovation at AI Field Day 7
  • The Evolution of Cloud at Cloud Field Day 24

Return to top of page

Copyright © 2025 · Genesis Framework · WordPress · Log in