Watch on YouTube
Watch on Vimeo
Monitoring and managing complex AI infrastructure requires moving beyond traditional networking tools that treat the environment as a black box. Praful Bhaidsana explains that the industry has long suffered from a mean time to truth problem where network operators are blamed for issues they cannot properly diagnose because they lack visibility into what is connected to the network. Arista aims to change this Stone Age approach by evolving from simple monitoring to 360-degree observability. This strategy is centered on CloudVision, a NetOps platform that utilizes a common network data lake called NetDL to aggregate high-fidelity streaming telemetry from every Arista device across the data center, campus, and WAN.
The architecture relies on the fact that Arista’s EOS provides consistent, reliable state data, ranging from MAC address tables and routing updates to microburst signals and configuration changes. This information is stored in a time-series database, allowing operators to travel back in time to compare network states before and after an incident. To manage the resulting deluge of data, Arista employs an AI/ML engine known as AVA, or Autonomous Virtual Assist. AVA identifies patterns and anomalies, filtering out the noise to show only the relevant signals. This allows human operators to focus on making informed decisions rather than spending hours manually correlating events across different silos.
Furthermore, CloudVision has opened its ecosystem to ingest data from third-party systems, AI job orchestrators, and compute and storage metrics via Prometheus. This integration is critical for AI environments where a job stall could be caused by anything from a GPU failure to a NIC issue. Arista has introduced a dedicated AI jobs dashboard that correlates specific training jobs with the underlying flows, servers, and switches. To simplify interactions with this massive dataset, a digital virtual assistant allows users to query their infrastructure using natural language. This integrated approach ensures that expensive GPU resources do not sit idle and that the resolution of complex performance bottlenecks can happen in minutes rather than days.
Personnel: Praful Bhaidasna
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!