Tech Field Day

The Independent IT Influencer Event

  • Home
    • The Futurum Group
    • FAQ
    • Staff
  • Sponsors
    • Sponsor List
      • 2025 Sponsors
      • 2024 Sponsors
      • 2023 Sponsors
      • 2022 Sponsors
    • Sponsor Tech Field Day
    • Best of Tech Field Day
    • Results and Metrics
    • Preparing Your Presentation
      • Complete Presentation Guide
      • A Classic Tech Field Day Agenda
      • Field Day Room Setup
      • Presenting to Engineers
  • Delegates
    • Delegate List
      • 2025 Delegates
      • 2024 Delegates
      • 2023 Delegates
      • 2022 Delegates
      • 2021 Delegates
      • 2020 Delegates
      • 2019 Delegates
      • 2018 Delegates
    • Become a Field Day Delegate
    • What Delegates Should Know
  • Events
    • All Events
      • Upcoming
      • Past
    • Field Day
    • Field Day Extra
    • Field Day Exclusive
    • Field Day Experience
    • Field Day Live
    • Field Day Showcase
  • Topics
    • Tech Field Day
    • Cloud Field Day
    • Mobility Field Day
    • Networking Field Day
    • Security Field Day
    • Storage Field Day
  • News
    • Coverage
    • Event News
    • Podcast
  • When autocomplete results are available use up and down arrows to review and enter to go to the desired page. Touch device users, explore by touch or with swipe gestures.
You are here: Home / Videos / MemVerge Memory Machine AI Transparent Checkpointing

MemVerge Memory Machine AI Transparent Checkpointing



AI Field Day 6


This video is part of the appearance, “MemVerge Presents at AI Field Day 6“. It was recorded as part of AI Field Day 6 at 14:00-15:30 on January 29, 2025.


Watch on YouTube
Watch on Vimeo

Bernie Wu’s presentation at AI Field Day 6 detailed MemVerge’s transparent checkpointing technology for AI workloads, addressing limitations of existing checkpointing methods. This technology, implemented as an MMAI Kubernetes operator, enables efficient pausing and relocation of long-running GPU tasks without requiring application modifications or awareness. This contrasts with other schedulers that necessitate application-level changes or cold restarts, significantly improving resource management and reducing friction for users.

The core of MemVerge’s approach is its ability to perform transparent checkpointing at the platform level, distinct from application-level checkpointing found in frameworks like PyTorch and TensorFlow. While the latter focuses on model optimization and rollback within the data scientist’s workflow, MemVerge’s solution targets site reliability engineers and platform engineers, handling tasks like graceful node maintenance, elastic workload bin-packing, and reclaiming idle resources, including spot instances. The technology initially developed for CPUs has been extended to GPUs through collaboration with NVIDIA, leveraging a two-stage checkpoint/restore process and techniques like incremental memory snapshots and asynchronous checkpointing to minimize overhead.

Future developments include parallelizing the checkpointing process for improved performance, extending support to AMD GPUs and multi-GPU nodes, and enabling cluster-wide checkpointing for distributed training and inferencing. MemVerge also plans to integrate their solution with other schedulers and expand its use cases to encompass hybrid cloud scheduling, heterogeneous pipelines, and HPC environments, further streamlining AI workload management and enhancing operational efficiency.

Personnel: Bernie Wu


  • Bluesky
  • LinkedIn
  • Mastodon
  • RSS
  • Twitter
  • YouTube

Event Calendar

  • Jul 9-Jul 10 — Networking Field Day 38
  • Aug 19-Aug 20 — Tech Field Day Extra at SHARE Cleveland 2025
  • Sep 10-Sep 11 — AI Infrastructure Field Day 3
  • Sep 24-Sep 25 — Security Field Day 14
  • Oct 22-Oct 23 — Cloud Field Day 24
  • Oct 29-Oct 30 — AI Field Day 7

Latest Coverage

  • Backups That Belong in the Cloud—But Not Too Close
  • Using a network diagram to configure the network – Another example of GenAI being used by networking vendors
  • Breaking Free from Hardcoded Security: Microsoft Introduces Human-in-the-Loop AI Agents
  • QlikConnect 25 – Keynote Updates – 05/14/2025
  • Uncompromising Network Visibility: How cPacket Augments Security with Advanced Telemetry and AI

Tech Field Day News

  • Have A Classy Time with Tech Field Day Extra at Cisco Live US 2025
  • Exploring Cloud Resilience, AI, and Data at Cloud Field Day 23

Return to top of page

Copyright © 2025 · Genesis Framework · WordPress · Log in