Watch on YouTube
Watch on Vimeo
Storage Density for AI Without Compromise with Solidigm. Allyn Malventano, AI/SSD Technologist at Solidigm, introduced his role in AI workload categorization and SSD optimization, then outlined the presentation, which would cover an overview of storage applications for AI, a cluster-scale exercise with MinIO, and a brief mention of MemKV and Metrim/RAG work. He highlighted how a traditional storage diagram from only a couple of years ago surprisingly omitted any mention of KV caching, illustrating the rapid evolution of AI infrastructure, in which KV caching has become crucial for accelerating inference as AI adoption spreads widely. Solidigm positions its high-capacity solutions in the lower, denser tiers of the AI storage pyramid, which are becoming increasingly vital as higher-tier resources like HBM and DRAM face severe constraints and rising prices.
Solidigm embarked on a significant cluster-scale exercise with MinIO, featuring an 8×8 setup comprising eight high-performance client systems initiating workloads and eight server systems fully populated with Solidigm’s 122 TB drives, totaling an impressive 24 petabytes of storage. Using simulated GPU workloads, the MinIO benchmarking tool accurately emulated storage initiation. The cluster leveraged 400 Gigabit Ethernet NICs, achieving over 250 gigabytes per second over TCP, a figure Malventano found extreme given TCP’s inherent overhead. This baseline performance scaled almost linearly up to 8 nodes and resulted from extensive tuning across the network stack, including switch and NIC buffer adjustments based on precise cable lengths and MinIO’s parity layout. This meticulous optimization led to a threefold increase in performance over the initial setup.
Looking ahead, Solidigm aims to explore further performance enhancements. Potential next steps include implementing dual-pathing to double initiator bandwidth, which is especially relevant for future integration with compute nodes such as the NVIDIA B200 or B300. A more significant avenue is to leverage RDMA (Remote Direct Memory Access) to bypass CPU overhead during data transfers, enabling direct memory-to-memory communication between network adapters. While the current exercise simulated GPU memory with DRAM, RDMA would still significantly reduce CPU bottlenecks, enabling even higher throughput. The combination of dual pathing and RDMA presents a complex yet promising approach to continually pushing the boundaries of storage performance for demanding AI workloads, despite the considerable tuning challenges involved.
Personnel: Allyn Malventano
Watch on YouTube
Watch on Vimeo
This presentation introduces MinIO MemKV, a critical new memory layer designed to address the challenges of scaling AI inference workloads, particularly for agentic applications. The core problem stems from the increasing size of context memory, known as the KV cache, which frequently exceeds the high-bandwidth memory (HBM) capacity of GPUs. In agentic workloads, where requests build upon previous interactions, this context memory constantly expands. When HBM is exhausted, GPUs resort to evicting and recomputing the KV cache, leading to wasted cycles during the “pre-fill” phase and significantly increasing the “Time To First Token” (TTFT) for users. This inefficient utilization plagues modern inference deployments, where GPUs might appear 100% utilized but are largely performing redundant computations.
MinIO MemKV offers a purpose-built solution by creating a distributed shared memory layer that GPUs can access quickly. It bypasses traditional file systems and kernel overhead by leveraging direct NVMe access to achieve microsecond latency. Unlike conventional enterprise storage, MemKV is engineered as an extension of memory, free from the “baggage” of durability features unnecessary for transient KV cache data. Benchmarks demonstrate impressive gains: a single H200 GPU node with two MemKV servers can handle up to 43 times more concurrent requests, achieve aggregated throughputs of nearly 97 gigabytes per second, and scale linearly to support vast superpods with thousands of GPUs and petabytes of context memory, delivering 16,000 concurrent requests per second and 12 terabits per second throughput. This dramatically reduces TTFT and ensures that GPUs focus on useful decoding rather than repetitive recomputation.
The speakers highlight that this bottleneck is fundamentally a “software problem” best addressed through optimized software, rather than complex hardware solutions like CXL, which is deemed too low-level and slow to adapt. MemKV’s approach allows for a “G3.5” memory tier that efficiently serves large KV cache blocks (2MB to 64MB tensors) directly from NVMe, avoiding file system metadata overhead. This enables superior effective GPU utilization, leading to significant cost reductions, estimated at over $2 million per year for a single H200 node in the public cloud. By providing a fast, scalable, and shared context memory, MemKV ensures GPUs perform meaningful work, boosting efficiency and handling the high concurrency demanded by modern AI inference.
Personnel: AB Periasamy, Dil Radhakrishnan
Watch on YouTube
Watch on Vimeo
This presentation detailed Solidigm’s collaboration with Metrum to explore how storage, specifically Solidigm’s SSDs, can effectively function as AI memory for Retrieval Augmented Generation (RAG) and Key-Value (KV) cache operations. Metrum’s core challenge involved ingesting vast amounts of video, generating detailed metadata from it, and creating a quickly searchable database for AI inference. The objective was to determine if SSDs could supplant or improve upon traditional memory-based vector databases, thereby optimizing performance and reducing costs. This led to an evaluation comparing HNSW, a primarily memory-based vector database, with DiskANN, a solution explicitly optimized for storage, using Solidigm SSDs.
Metrum’s benchmarks across 1 million, 10 million, and 100 million datasets revealed compelling results. While HNSW showed a slight edge at lower concurrencies with smaller datasets, DiskANN demonstrated superior performance as concurrency increased, especially with real-world video pipeline data. This indicated that a purpose-built, storage-first approach could outperform memory-centric solutions in critical scenarios, proving that significant cost savings on DRAM could be achieved without sacrificing performance for large-scale vector databases. The ability to move these substantial datasets from expensive DRAM to more economical SSDs offers a powerful advantage for AI infrastructure.
Beyond vector database lookups, the presentation emphasized the critical role of SSDs in KV-cache offload. By caching recurrent queries on Solidigm drives, the system drastically improved the “time to first token” and eliminated redundant GPU recomputation, which otherwise wastes power and delays responses. This efficiency is crucial for serving hundreds or thousands of concurrent users, providing faster, more responsive AI interactions, and freeing up valuable GPU compute cycles for other tasks. This concept extends beyond video and applies to any large-scale AI system that benefits from efficiently persistent context and reduced re-inference, highlighting how intelligent storage integration can lead to substantial gains in both performance and operational efficiency across the entire AI pipeline.
Personnel: Allyn Malventano
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!