The Anatomy of a Prompt with Solidigm

This video is part of the appearance, "Solidigm Presents at AI Field Day 8". It was recorded as part of AI Field Day 8 at 13:30-15:00 on May 14, 2026.

Watch on YouTube
Watch on Vimeo

Solidigm presented on the anatomy of an AI prompt, emphasizing storage as a critical day-zero design decision for building AI inference infrastructure. Kapil Karkra demonstrated how a simple user prompt can undergo massive “token amplification,” transforming a small input into tens of thousands of tokens sent to a Large Language Model. This expansion occurs as an AI agent incorporates domain-specific rules, retrieves relevant contextual information from memory or vector databases, integrates tool definitions for planning, and maintains session history. This process highlights storage’s role in assembling the comprehensive prompt that an LLM processes.

The presentation then decomposed the “time to first token,” a key metric for LLM responsiveness, into several components. These include time for prompt assembly, network transmission, queuing delays at the LLM provider, and the GPU-intensive prefill and decode phases. Storage directly influences the assembly time by rapidly providing context and significantly impacts the prefill phase. By caching stable portions of the LLM’s context, storage can convert the computationally expensive, O(n^2) complexity of GPU recomputation into much faster O(n) reads, thereby freeing up valuable GPU resources and reducing power consumption. This makes high-performance storage essential for maintaining low latency and efficient inference.

Solidigm identified two critical workloads that demand specific storage characteristics for optimal AI performance. Retrieval Augmented Generation (RAG) involves frequent, small, random reads at low queue depths but high concurrency from vector databases, necessitating SSDs with excellent P99 or P99.9 tail latency performance. The KV cache offload, on the other hand, deals with large block reads and is bandwidth-oriented, crucial for efficiently retrieving cached context states that spill from HBM as context lengths grow. As session data quickly scales from gigabytes to terabytes per fleet, the presentation concluded by stressing that procurement decisions should be based on these specific workload demands rather than general high-queue-depth performance metrics, advocating for tiered storage solutions tailored to each workload’s unique needs.

Personnel: Kapil Karkra, Scott Shadley

Fortinet Oddly Puts LCD Screens and LoraWAN on Wi-Fi 7 APs at MFD14

HPE Bets on Standard Power to Fix 6 GHz’s Weakest Link

Is Object Storage Becoming Part of the AI Memory Hierarchy?

Big Branch Improvements from Cisco

The New Governance Control Plane for Enterprise AI

AIOps Tools: Forward

The Anatomy of a Prompt with Solidigm

Sign up for updates to Tech Field day events

Sign up for updates to
Tech Field day events