|
Manzur Rahman and Ace Stryker presented for Solidigm at AI Data Infrastructure Field Day 1 |
This Presentation date is October 3, 2024 at 10:30-12:30.
Presenters: Ace Stryker, Manzur Rahman
How Data Infrastructure Improves or Impedes Al Value Creation with Solidigm
Watch on YouTube
Watch on Vimeo
Ace Stryker from Solidigm presented on the critical role of data infrastructure in AI value creation, emphasizing the importance of quality and quantity in training data. He illustrated this with an AI-generated image of a hand with an incorrect number of fingers, highlighting the limitations of AI models that lack intrinsic understanding of the objects they depict. This example underscored the necessity for high-quality training data to improve AI model outputs. Stryker explained that AI models predict desired outputs based on training data, which often lacks comprehensive information about the objects, leading to errors. He stressed that these challenges are not unique to image generation but are prevalent across various AI applications, where data variety, low error margins, and limited training data pose significant hurdles.
Stryker outlined the AI data pipeline, breaking it down into five stages: data ingestion, data preparation, model development, inference, and archiving. He detailed the specific data and performance requirements at each stage, noting that data magnitude decreases as it moves through the pipeline, while the type of I/O operations varies. For instance, data ingestion involves large sequential writes to object storage, while model training requires random reads from high-performance storage. He also discussed the importance of checkpointing during model training to prevent data loss and ensure efficient recovery. Stryker highlighted the growing trend of distributing AI workloads across core data centers, regional data centers, and edge servers, driven by the need for faster processing, data security, and reduced data transfer costs.
The presentation also addressed the challenges and opportunities of deploying AI at the edge. Stryker noted that edge environments often have lower power budgets, space constraints, and higher serviceability requirements compared to core data centers. He provided examples of edge deployments, such as medical imaging in hospitals and autonomous driving solutions, where high-density storage solutions like QLC SSDs are used to enhance data collection and processing. Stryker emphasized the need for storage vendors to adapt to these evolving requirements, ensuring that their products can meet the demands of both core and edge AI applications. The session concluded with a discussion on Solidigm’s product portfolio and how their SSDs are designed to optimize performance, energy efficiency, and cost in AI deployments.
Personnel: Ace Stryker
Optimizing Data Center TCO An In Depth Analysis and Sensitivity Study with Solidigm
Watch on YouTube
Watch on Vimeo
Manzur Rahman from Solidigm presented an in-depth analysis of Total Cost of Ownership (TCO) for data centers, emphasizing its growing importance in the AI era. TCO encompasses acquisition, operation, and maintenance costs, and is crucial for evaluating cost-effective, high-performance hardware like GPUs, storage, and AI chips. Rahman highlighted the need for energy-efficient solutions and the importance of right-sizing storage to avoid over or under-provisioning. He explained that TCO includes both direct costs (materials, labor, energy) and indirect costs (overheads, cooling, carbon tax), and uses a normalization method to provide a comprehensive cost per terabyte effective per month per rack.
Rahman detailed Solidigm’s TCO model, which incorporates dynamic variables such as hardware configuration, drive replacement cycles, and workload mixes. The model also factors in the time value of money, maintenance, disposal costs, and greenhouse gas taxes. By comparing HDD and SSD racks under various scenarios, Solidigm found that SSDs can offer significant TCO benefits, especially when variables like replacement cycles, capacity utilization, and data compression are optimized. For instance, extending the SSD replacement cycle from five to seven years can improve TCO by 22%, and increasing capacity utilization can lead to a 67% improvement.
The presentation concluded with a sensitivity analysis showing that high-density QLC SSDs can significantly reduce TCO compared to HDDs. Even with higher upfront costs, the overall TCO is lower due to better performance, longer replacement cycles, and higher capacity utilization. Rahman projected that high-density QLC SSDs will continue to offer TCO improvements in the coming years, making them a promising solution for data centers, particularly in AI environments. The analysis demonstrated that while CAPEX for SSDs is higher, the overall cost per terabyte effective is lower, making SSDs a cost-effective choice for future data center deployments.
Personnel: Manzur Rahman
The Energy Crunch Is Not a Future Problem with Solidigm
Watch on YouTube
Watch on Vimeo
In the presentation by Solidigm at AI Data Infrastructure Field Day 1, Manzur Rahman emphasized the critical issue of energy consumption in AI and data infrastructure. He referenced quotes from industry leaders like Sam Altman and Mark Zuckerberg, highlighting the significant challenge energy poses in scaling AI operations. Rahman discussed findings from white papers by Meta and Microsoft Azure, which revealed that a substantial portion of energy consumption in data centers is attributed to hard disk drives (HDDs). Specifically, Meta’s AI recommendation engine and Microsoft’s cloud services found that HDDs consumed 35% and 33% of total operational energy, respectively. This underscores the need for more energy-efficient storage solutions to manage the growing data demands.
Rahman then explored various use cases and the increasing need for network-attached storage (NAS) in AI applications. He noted that data is growing exponentially, with different modalities like text, audio, and video contributing to the data deluge. For instance, hyper-scale large language models (LLMs) and large video models (LVMs) require massive amounts of storage, ranging from 1.3 petabytes to 32 petabytes per GPU rack. The trend towards synthetic data and data repatriation is further driving the demand for NAS. Solidigm’s model for a 50-megawatt data center demonstrated that using QLC (Quad-Level Cell) storage instead of traditional HDDs and TLC (Triple-Level Cell) storage could significantly reduce energy consumption and increase the number of GPU racks that can be supported.
The presentation concluded with a comparison of different storage configurations, showing that QLC storage offers substantial energy savings and space efficiency. For example, a DGX H100 rack with QLC storage consumed only 6.9 kilowatts compared to 32 kilowatts for a setup with TLC and HDDs. This translates to 4x fewer storage racks, 80% less storage power, and 50% more DGX plus NAS racks in a 50-megawatt data center. Rahman also addressed concerns about heat generation and longevity, noting that while QLC may generate more heat and have fewer P/E cycles compared to TLC, the overall energy efficiency and performance benefits make it a viable solution for modern data centers. Solidigm’s high-density drives, such as the P-5520 and QLCP-P-5430, were highlighted as effective in reducing rack space and power consumption, further supporting the case for transitioning to more energy-efficient storage technologies.
Personnel: Manzur Rahman
Optimizing Storage for AI Workloads with Solidigm
Watch on YouTube
Watch on Vimeo
In this presentation, Ace Stryker from Solidigm discusses the company’s unique value proposition in the AI data infrastructure market, focusing on their high-density QLC SSDs and the recently announced Gen 5 TLC SSDs. He emphasizes the importance of selecting the right storage architecture for different phases of the AI pipeline, from data ingestion to archiving. Solidigm’s QLC SSDs, with their high density and power efficiency, are recommended for the beginning and end of the pipeline, where large volumes of unstructured data are handled. For the middle stages, where performance is critical, Solidigm offers the D7 PS1010 Gen 5 TLC SSD, which boasts impressive sequential and random read performance, making it ideal for keeping GPUs maximally utilized.
The presentation highlights the flexibility of Solidigm’s product portfolio, which allows customers to optimize for various goals, whether it’s power efficiency, GPU utilization, or overall performance. The Gen 5 TLC SSD, the D7 PS1010, is positioned as the performance leader, capable of delivering 14.5 gigabytes per second sequential read speeds. Additionally, Solidigm offers other options like the 5520 and 5430 drives, catering to different performance and endurance needs. The discussion also touches on the efficiency of these drives, with Solidigm’s products outperforming competitors in various AI workloads, as demonstrated by the ML Commons ML Perf Storage Benchmark results.
A notable case study presented is the collaboration with the Zoological Society of London to conserve urban hedgehogs. Solidigm’s high-density QLC SSDs are used in an edge data center at the zoo, enabling efficient processing and analysis of millions of images captured by motion-activated cameras. This setup allows the organization to assess hedgehog populations and make informed conservation decisions. The presentation concludes by emphasizing the importance of efficient data infrastructure in AI applications and Solidigm’s commitment to delivering high-density, power-efficient storage solutions that meet the evolving needs of AI workloads.
Personnel: Ace Stryker