|
This video is part of the appearance, “VMware by Broadcom Presents Private AI with Intel at AI Field Day 4“. It was recorded as part of AI Field Day 4 at 9:45-10:45 on February 22, 2024.
Watch on YouTube
Watch on Vimeo
Looking to deploy AI models using your existing data center investments? VMware and Intel have collaborated to announce VMware Private AI with Intel. VMware Private AI with Intel will help enterprises build and deploy private and secure AI models running on VMware Cloud Foundation and boost AI performance by harnessing Intel’s AI software suite and 4th Generation Intel® Xeon® Scalable Processors with built-in accelerators. In this session we’ll explain how to set up Tanzu Kubernetes to run AI/ML workloads that utilize AMX CPUs.
Earl Ruby, R&D engineer at VMware by Broadcom, presented deployment of AI models without GPUs, focusing on the use of Intel AMX CPUs with Tanzu Kubernetes on vSphere. He discussed the benefits of AMX, an AI accelerator built into Intel’s Sapphire Rapids and Emerald Rapids Xeon CPUs, which can run AI workloads without separate GPU accelerators. vSphere 8 supports AMX, and many ML frameworks are already optimized for Intel CPUs.
He demonstrated video processing with OpenVINO on vSphere 8, showing real-time processing with high frame rates on a VM with limited resources and no GPUs. This demonstration highlighted the power of AMX and OpenVINO’s model compression, which reduces memory and compute requirements.
For deploying AMX-powered workloads on Kubernetes, Earl explained that Tanzu is VMware’s Kubernetes distribution optimized for vSphere, with lifecycle management tools, storage, networking, and high availability features. He detailed the requirements for making AMX work on vSphere, including using hardware with Sapphire Rapids or Emerald Rapids CPUs, running the Linux kernel 5.16 or later, and using hardware version 20 for virtualizing AMX instructions.
Earl provided a guide for setting up Tanzu to use AMX, including adding a content library with the correct Tanzu Kubernetes releases (TKRs) and creating a new VM class. He showed how to create a cluster definition file for Tanzu Kubernetes clusters that specifies the use of the HWE kernel TKR and the AMX VM class for worker nodes.
Finally, he presented performance results of the Llama 2 7 billion LLM inference running on a single fourth-gen Xeon CPU, demonstrating that it could deliver inference with an average latency under 100 milliseconds, which is suitable for chatbot response times.
Personnel: Earl Ruby