Watch on YouTube
Watch on Vimeo
Looking to deploy AI models using your existing data center investments? VMware and Intel have collaborated to announce VMware Private AI with Intel. VMware Private AI with Intel will help enterprises build and deploy private and secure AI models running on VMware Cloud Foundation and boost AI performance by harnessing Intel’s AI software suite and 4th Generation Intel® Xeon® Scalable Processors with built-in accelerators. In this session we’ll explain how to set up Tanzu Kubernetes to run AI/ML workloads that utilize AMX CPUs.
Earl Ruby, R&D engineer at VMware by Broadcom, presented deployment of AI models without GPUs, focusing on the use of Intel AMX CPUs with Tanzu Kubernetes on vSphere. He discussed the benefits of AMX, an AI accelerator built into Intel’s Sapphire Rapids and Emerald Rapids Xeon CPUs, which can run AI workloads without separate GPU accelerators. vSphere 8 supports AMX, and many ML frameworks are already optimized for Intel CPUs.
He demonstrated video processing with OpenVINO on vSphere 8, showing real-time processing with high frame rates on a VM with limited resources and no GPUs. This demonstration highlighted the power of AMX and OpenVINO’s model compression, which reduces memory and compute requirements.
For deploying AMX-powered workloads on Kubernetes, Earl explained that Tanzu is VMware’s Kubernetes distribution optimized for vSphere, with lifecycle management tools, storage, networking, and high availability features. He detailed the requirements for making AMX work on vSphere, including using hardware with Sapphire Rapids or Emerald Rapids CPUs, running the Linux kernel 5.16 or later, and using hardware version 20 for virtualizing AMX instructions.
Earl provided a guide for setting up Tanzu to use AMX, including adding a content library with the correct Tanzu Kubernetes releases (TKRs) and creating a new VM class. He showed how to create a cluster definition file for Tanzu Kubernetes clusters that specifies the use of the HWE kernel TKR and the AMX VM class for worker nodes.
Finally, he presented performance results of the Llama 2 7 billion LLM inference running on a single fourth-gen Xeon CPU, demonstrating that it could deliver inference with an average latency under 100 milliseconds, which is suitable for chatbot response times.
Personnel: Earl Ruby
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!