Watch on YouTube
Watch on Vimeo
Solidigm presented on the anatomy of an AI prompt, emphasizing storage as a critical day-zero design decision for building AI inference infrastructure. Kapil Karkra demonstrated how a simple user prompt can undergo massive “token amplification,” transforming a small input into tens of thousands of tokens sent to a Large Language Model. This expansion occurs as an AI agent incorporates domain-specific rules, retrieves relevant contextual information from memory or vector databases, integrates tool definitions for planning, and maintains session history. This process highlights storage’s role in assembling the comprehensive prompt that an LLM processes.
The presentation then decomposed the “time to first token,” a key metric for LLM responsiveness, into several components. These include time for prompt assembly, network transmission, queuing delays at the LLM provider, and the GPU-intensive prefill and decode phases. Storage directly influences the assembly time by rapidly providing context and significantly impacts the prefill phase. By caching stable portions of the LLM’s context, storage can convert the computationally expensive, O(n^2) complexity of GPU recomputation into much faster O(n) reads, thereby freeing up valuable GPU resources and reducing power consumption. This makes high-performance storage essential for maintaining low latency and efficient inference.
Solidigm identified two critical workloads that demand specific storage characteristics for optimal AI performance. Retrieval Augmented Generation (RAG) involves frequent, small, random reads at low queue depths but high concurrency from vector databases, necessitating SSDs with excellent P99 or P99.9 tail latency performance. The KV cache offload, on the other hand, deals with large block reads and is bandwidth-oriented, crucial for efficiently retrieving cached context states that spill from HBM as context lengths grow. As session data quickly scales from gigabytes to terabytes per fleet, the presentation concluded by stressing that procurement decisions should be based on these specific workload demands rather than general high-queue-depth performance metrics, advocating for tiered storage solutions tailored to each workload’s unique needs.
Personnel: Kapil Karkra, Scott Shadley
Watch on YouTube
Watch on Vimeo
Solidigm emphasizes that storage is a foundational, day-zero consideration for successful AI infrastructure, likening it to the essential dough for a great pizza. The company advocates for a holistic approach to AI storage architecture, moving beyond simply adding more bytes to existing boxes. This involves understanding the unique demands of AI workloads and designing specialized solutions. Solidigm breaks down storage into tiers, from high-bandwidth memory (HBM) and DRAM at the top (G1, G2) to local performance NVMe (G2.5, G3), in-rack DPU-led NVMe (G3.5), and network-attached NVMe with hard drives for larger capacities (G4). The adoption of NVMe is crucial across these tiers to support the rapid data access required by modern AI.
The presentation highlights significant global infrastructure challenges, including power availability, sustainability directives, and physical footprint limitations. Solidigm addresses these with innovative hardware designs, such as liquid-cooled E1.S form factor SSDs that offer substantially lower power consumption and improved thermal management compared to traditional air-cooled or larger form factors. This allows for higher density compute and more efficient data centers. The company actively co-designs solutions with partners like NVIDIA to ensure optimal thermal dissipation and integration. Furthermore, Solidigm points to opportunities for reducing environmental impact through heat recapture and extended product lifespans with SSDs, which outlast hard drives.
Solidigm is redefining value beyond traditional Total Cost of Ownership (TCO) by focusing on an “efficiency estimator” that incorporates sustainability impact. This approach considers how optimized storage choices can free up finite power for more compute resources within a rack. They offer a range of SSD products tailored to different AI workload characteristics, from high-performance ephemeral caches to durable, high-capacity long-term storage. By consulting with customers across various segments, Solidigm aims to provide the right storage solutions that effectively integrate into diverse AI architectures, acknowledging that success comes from a collaborative ecosystem approach rather than just selling individual drives.
Personnel: Scott Shadley
Watch on YouTube
Watch on Vimeo
Solidigm’s presentation at AI Field Day 8, led by Kapil Karkra, highlighted memory capacity as a critical, often overlooked, third axis for scaling AI intelligence, alongside model size and compute power. Solidigm introduced its “CRAFT” framework to define and measure AI intelligence across five dimensions: Comprehension, Recall, Adaptability, Fluency, and Tenacity. The core argument is that expanding memory capacity beyond the GPU’s high-bandwidth memory (HBM) to system DRAM and NVMe SSDs dramatically improves AI performance and quality by enabling more efficient inference and preventing costly recomputations.
Through various benchmarks and experiments, Solidigm demonstrated the impact of memory capacity on each CRAFT dimension. For Recall, offloading Key-Value (KV) cache to SSDs prevented the GPU from recomputing previous states, significantly boosting throughput. Tenacity was illustrated with an AIME 2024 math test, where increased output token capacity allowed the model to deliberate longer and achieve a higher score, showcasing how more “scratch space” leads to better reasoning quality. Adaptability, measured by requests per second, and Fluency, indicated by inter-token latency, both saw substantial improvements (up to 4x throughput and 21x better latency) when NVMe SSDs extended the KV cache, allowing the system to handle more concurrent requests without compromising responsiveness. Similarly, Comprehension, tested with a “needle in a haystack” benchmark, showed 78 times faster reading when context fit in the extended cache.
The presentation concluded that while higher bandwidth storage is beneficial when working sets fit within faster tiers, ultimately, sheer capacity becomes paramount for larger, more complex AI workloads involving multiple agents and extensive context lengths. The discussion emphasized the need for a tiered memory hierarchy, where automatic caching across HBM, DRAM, and NVMe SSDs optimizes resource utilization and avoids GPU stalls. This approach allows organizations to balance performance and cost effectively, ensuring that AI systems can sustain deeper reasoning, handle greater concurrency, and deliver higher quality, more fluent responses by leveraging expanded memory capacity.
Personnel: Kapil Karkra, Scott Shadley
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!