|
This Presentation date is February 22, 2024 at 9:45-10:45.
Presenters: Earl Ruby
AI without GPUs: Using Intel AMX CPUs on VMware vSphere for LLMs
Watch on YouTube
Watch on Vimeo
Looking to deploy AI models using your existing data center investments? VMware and Intel have collaborated to announce VMware Private AI with Intel. VMware Private AI with Intel will help enterprises build and deploy private and secure AI models running on VMware Cloud Foundation and boost AI performance by harnessing Intel’s AI software suite and 4th Generation Intel® Xeon® Scalable Processors with built-in accelerators. In this session we’ll explain the technology behind AMX CPUs and demonstrate LLMs running on AMX CPUs.
Earl Ruby, R&D Engineer at VMware by Broadcom, discusses leveraging AI without the need for GPUs, focusing on using CPUs for AI workloads. He talks about VMware’s collaboration with Intel on VMware Private AI with Intel, which enables enterprises to build and deploy private AI models on-premises using VMware Cloud Foundation and Intel’s AI software suite along with the 4th Generation Intel Xeon Scalable Processors with built-in accelerators.
Ruby highlights the benefits of Private AI, including data privacy, intellectual property protection, and the use of established security tools in a vSphere environment. He explains the technology behind Intel’s Advanced Matrix Extensions (AMX) CPUs and how they can accelerate AI/ML workloads without the need for separate GPU accelerators. AMX CPUs are integrated into the core of Intel’s Sapphire Rapids and Emerald Rapids servers, allowing for the execution of AI and non-AI workloads in a virtualized environment. Ruby demonstrates the performance of Large Language Models (LLMs) running on AMX CPUs compared to older CPUs without AMX, showing a significant improvement in speed and efficiency.
He also discusses the operational considerations when choosing between CPU and GPU for AI workloads, emphasizing that CPUs should be used when performance is sufficient and cost or power consumption are concerns, while GPUs should be used for high-performance needs, especially when low latency or frequent fine-tuning of large models is required.
Personnel: Earl Ruby
AI without GPUs: Using Intel AMX CPUs on VMware vSphere with Tanzu Kubernetes
Watch on YouTube
Watch on Vimeo
Looking to deploy AI models using your existing data center investments? VMware and Intel have collaborated to announce VMware Private AI with Intel. VMware Private AI with Intel will help enterprises build and deploy private and secure AI models running on VMware Cloud Foundation and boost AI performance by harnessing Intel’s AI software suite and 4th Generation Intel® Xeon® Scalable Processors with built-in accelerators. In this session we’ll explain how to set up Tanzu Kubernetes to run AI/ML workloads that utilize AMX CPUs.
Earl Ruby, R&D engineer at VMware by Broadcom, presented deployment of AI models without GPUs, focusing on the use of Intel AMX CPUs with Tanzu Kubernetes on vSphere. He discussed the benefits of AMX, an AI accelerator built into Intel’s Sapphire Rapids and Emerald Rapids Xeon CPUs, which can run AI workloads without separate GPU accelerators. vSphere 8 supports AMX, and many ML frameworks are already optimized for Intel CPUs.
He demonstrated video processing with OpenVINO on vSphere 8, showing real-time processing with high frame rates on a VM with limited resources and no GPUs. This demonstration highlighted the power of AMX and OpenVINO’s model compression, which reduces memory and compute requirements.
For deploying AMX-powered workloads on Kubernetes, Earl explained that Tanzu is VMware’s Kubernetes distribution optimized for vSphere, with lifecycle management tools, storage, networking, and high availability features. He detailed the requirements for making AMX work on vSphere, including using hardware with Sapphire Rapids or Emerald Rapids CPUs, running the Linux kernel 5.16 or later, and using hardware version 20 for virtualizing AMX instructions.
Earl provided a guide for setting up Tanzu to use AMX, including adding a content library with the correct Tanzu Kubernetes releases (TKRs) and creating a new VM class. He showed how to create a cluster definition file for Tanzu Kubernetes clusters that specifies the use of the HWE kernel TKR and the AMX VM class for worker nodes.
Finally, he presented performance results of the Llama 2 7 billion LLM inference running on a single fourth-gen Xeon CPU, demonstrating that it could deliver inference with an average latency under 100 milliseconds, which is suitable for chatbot response times.
Personnel: Earl Ruby