Watch on YouTube
Watch on Vimeo
Enfabrica’s presentation at AI Field Day 5, led by founder and CEO Rochan Sankar, delved into the company’s innovative solutions for addressing AI cluster scaling and reliability challenges. Sankar highlighted the benefits of Enfabrica’s Aggregation and Collapsing Fabric System (ACFS), which enables wide fabrics with fewer hops, significantly reducing GPU-to-GPU hop latency. This reduction in latency is crucial for improving the performance of parallel workloads across GPUs, not just in training but also in other applications. The ACFS allows for a 32x multiplier in network ports, facilitating the connection of up to 500,000 GPUs in just two layers of switching, compared to the traditional three layers. This streamlined architecture enhances job performance and increases utilization, offering a potential 50-60% savings in total cost of ownership (TCO) on the network side.
Sankar also discussed the resiliency improvements brought by the multi-planar switch fabric, which ensures that every GPU or connected element can multipath out in case of failures. This hardware-based failover mechanism allows for immediate traffic rerouting without loss, while software optimizations ensure optimal load balancing. The presentation emphasized the importance of this resiliency, especially as AI clusters scale and the network’s reliability becomes increasingly critical. Enfabrica’s approach addresses the challenges posed by optical connections and high failure rates, ensuring that GPU operations remain unaffected by individual component failures, thus maintaining overall system performance and reliability.
In the context of AI inference and retrieval-augmented generation (RAG), Sankar explained how the ACFS can provide massive bandwidth to both accelerators and memory, creating a memory area network with microsecond access times. This architecture supports a tiered cache-driven approach, optimizing the use of expensive memory resources like HBM. By leveraging cheaper memory options and shared memory elements, Enfabrica’s solution can significantly enhance the efficiency and scalability of AI inference workloads. The presentation concluded with a summary of the ACFS’s capabilities, including high throughput, programmatic control of the fabric, and substantial power savings, positioning it as a critical component for next-generation data centers and large-scale AI deployments.
Personnel: Rochan Sankar
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!