Gemini Cloud Assist in Google Cloud

Event: Cloud Field Day 20

Appearance: Google Cloud Presents at Cloud Field Day 20

Company: Google Cloud

Video Links:

Personnel: Bobby Allen

Gemini Cloud Assist, a feature of Google Cloud, serves as an extensible cloud intelligence tool designed to enhance user efficiency by providing actionable insights and recommendations. Bobby Allen emphasizes that Gemini Cloud Assist integrates the intelligence from Google’s Gemini model to supercharge workloads on Google Cloud. This feature is particularly beneficial for users who are not necessarily building AI but are looking to optimize their existing cloud infrastructure. It offers insights on various aspects such as cost-saving opportunities, operational efficiencies, and application design improvements, all contextualized within the user’s specific cloud environment.

One of the primary advantages of Gemini Cloud Assist is its ability to save time and reduce technical debt. As organizations face increasing demands without corresponding increases in budget or personnel, tools like Gemini Cloud Assist become essential. The feature provides actionable insights directly within the Google Cloud console, allowing users to address inefficiencies such as underutilized resources or potential upgrade issues. For example, it can identify idle clusters that may be candidates for cost-saving measures like autopilot mode and even offer commands and best practices to implement these changes. This functionality ensures that users can maintain optimal performance and cost-efficiency without needing to be experts in every aspect of their cloud environment.

Gemini Cloud Assist also addresses the challenge of keeping up with rapidly evolving technology. As training can quickly become outdated, the feature acts as a knowledgeable assistant, providing real-time, resource-aware insights and recommendations. It helps users navigate complex cloud environments by surfacing relevant information and best practices, thereby reducing the cognitive load on IT professionals. Additionally, the tool supports user queries through a chat interface, offering contextual answers based on the user’s specific resources. This makes it easier for users to implement best practices and optimize their cloud infrastructure effectively, ensuring they stay ahead of potential issues and maintain a high level of operational efficiency.


Google Cloud Network Infrastructure for AI/ML

Event: Cloud Field Day 20

Appearance: Google Cloud Presents at Cloud Field Day 20

Company: Google Cloud

Video Links:

Personnel: Victor Moreno

Victor Moreno, a product manager at Google Cloud, presented on the network infrastructure Google Cloud has developed to support AI and machine learning (AI/ML) workloads. The exponential growth of AI/ML models necessitates moving vast amounts of data across networks, making it impossible to rely on a single TPU or host. Instead, thousands of nodes must communicate efficiently, which Google Cloud achieves through a robust software-defined network (SDN) that includes hardware acceleration. This infrastructure ensures that GPUs and TPUs can communicate at line rates, dealing with challenges like load balancing and data center topology restructuring to match traffic patterns.

Google Cloud’s AI/ML network infrastructure involves two main networks: one for GPU-to-GPU communication and another for connecting to external storage and data sources. The GPU network is designed to handle high bandwidth and low latency, essential for training large models distributed across many nodes. This network uses a combination of electrical and optical switching to create flexible topologies that can be reconfigured without physical changes. The second network connects the GPU clusters to storage, ensuring periodic snapshots of the training process are stored efficiently. This dual-network approach allows for high-performance data processing and storage communication within the same data center region.

In addition to the physical network infrastructure, Google Cloud leverages advanced load balancing techniques to optimize AI/ML workloads. By using custom metrics like queue depth, Google Cloud can significantly improve response times for AI models. This optimization is facilitated by tools such as the Open Request Cost Aggregation (ORCA) framework, which allows for more intelligent distribution of requests across model instances. These capabilities are integrated into Google Cloud’s Vertex AI service, providing users with scalable, efficient AI/ML infrastructure that can automatically adjust to workload demands, ensuring high performance and reliability.


AI Workloads and Hardware Accelerators – Introducing the Google Cloud AI Hypercomputer

Event: Cloud Field Day 20

Appearance: Google Cloud Presents at Cloud Field Day 20

Company: Google Cloud

Video Links:

Personnel: Ishan Sharma

Ishan Sharma, a Senior Product Manager for Google Kubernetes Engine (GKE), presented advancements in enhancing AI workloads on Google Cloud during Cloud Field Day 20. He emphasized the rapid evolution of AI research and its practical applications across various sectors, such as content generation, pharmaceutical research, and robotics. Google Cloud’s infrastructure, including its AI hypercomputer, is designed to support these complex AI models by providing robust and scalable solutions. Google’s extensive experience in AI, backed by over a decade of research, numerous publications, and technologies like the Transformer model and Tensor Processing Units (TPUs), positions it uniquely to meet the needs of customers looking to integrate AI into their workflows.

Sharma highlighted why customers prefer Google Cloud for AI workloads, citing the platform’s performance, flexibility, and reliability. Google Cloud offers a comprehensive portfolio of AI supercomputers that cater to different workloads, from training to serving. The infrastructure is built on a truly open and comprehensive stack, supporting both Google-developed models and those from third-party partners. Additionally, Google Cloud ensures high reliability and security, with metrics focused on actual work done rather than just capacity. The global scale of Google Cloud, with 37 regions and cutting-edge infrastructure, combined with a commitment to 100% renewable energy, makes it an attractive option for AI-driven enterprises.

The presentation also covered the specifics of Google Cloud’s AI Hypercomputer, a state-of-the-art platform designed for high performance and efficiency across the entire stack from hardware to software. This includes various AI accelerators like GPUs and TPUs, and features like the dynamic workload scheduler (DWS) for optimized resource management. Sharma explained how GKE supports AI workloads with tools like Q for job queuing and DWS for dynamic scheduling, enabling better utilization of resources. Additionally, GKE’s flexibility allows it to handle both training and inference workloads efficiently, offering features like rapid node startup and GPU sharing to drive down costs and improve performance.


Security in Google Cloud

Event: Cloud Field Day 20

Appearance: Google Cloud Presents at Cloud Field Day 20

Company: Google Cloud

Video Links:

Personnel: Glen Messenger

In his presentation at Cloud Field Day 20, Glenn Messinger, Product Manager for Google’s GKE security team, discussed the complexities and challenges of securing Kubernetes environments. He emphasized that while Kubernetes offers significant power and flexibility, these attributes also introduce substantial complexity, making security a primary concern for users. Many Kubernetes users have experienced security incidents, either in production or during deployment, highlighting the need for robust security measures. Google’s approach to GKE security focuses on reducing risk, enhancing compliance, and improving operational efficiency. Messinger introduced the concept of Kubernetes Security Posture Management (KSPM), which is designed to automate security and compliance specifically for Kubernetes environments.

Messinger detailed several key areas of focus within KSPM, including vulnerability management, threat detection, and compliance and governance. For vulnerability management, Google has developed GKE Security Posture, a tool that performs runtime-based vulnerability detection on clusters, providing detailed insights into container OS vulnerabilities and language packs. The tool is designed to be user-friendly, allowing customers to filter vulnerabilities by severity, region, cluster, and other parameters. In terms of threat detection, Messinger highlighted the capabilities of GKE Threat Detection, which utilizes both log detection and behavior-based detection methods to identify and mitigate potential threats. This service is integrated with Google’s Security Command Center, providing a comprehensive view of threats across the entire GCP environment.

Regarding compliance and governance, Messinger explained that GKE compliance tools help customers adhere to industry standards and set governance guardrails. These tools provide dashboards that show compliance status and detailed remediation steps for identified issues. Additionally, Google’s policy controller, which utilizes OPA Gatekeeper, allows for the customization of policies to meet specific compliance requirements. Messinger concluded the presentation by addressing questions about automated remediation, the ability to filter and mute known vulnerabilities, and protections against data encryption attacks. Overall, Google’s GKE security efforts aim to simplify the management of security and compliance in Kubernetes environments, enabling customers to innovate while minimizing risk.


AI/ML Storage Workloads in Google Cloud

Event: Cloud Field Day 20

Appearance: Google Cloud Presents at Cloud Field Day 20

Company: Google Cloud

Video Links:

Personnel: Sean Derrington

Sean Derrington from Google Cloud’s storage group presents advancements in cloud storage, particularly for AI and ML workloads. Google Cloud has focused on optimizing storage solutions to support the unique requirements of AI and ML applications, such as the need for high throughput and low latency. Key innovations include the Anywhere Cache, which allows data to be cached close to GPU and TPU resources to accelerate training processes, and the parallel file system, which is based on Intel DAOS and is designed to handle ultra-low latency and high throughput. These advancements aim to provide flexible and scalable storage options that can adapt to various workloads and performance needs.

Derrington also highlights the introduction of HyperDisk ML, a block storage offering that enables volumes of data to be accessible as read-only across thousands of hosts, further speeding up data loading for training. Furthermore, Google Cloud has introduced Cloud Storage FUSE with caching, which allows customers to mount a bucket as if it were a file system, reducing storage costs and improving training efficiency by eliminating the need for multiple data copies. These solutions are designed to decrease the time required for training epochs, thereby enhancing the overall efficiency of AI and ML workloads.

In addition to AI and ML optimizations, Google Cloud has focused on providing robust storage solutions for other workloads, such as GKE and enterprise applications. Filestore offers various instance types—Basic, Zonal, and Regional—each catering to different performance, capacity, and availability needs. Filestore Multi-Share allows for the provisioning of small persistent volumes, scaling automatically as needed. HyperDisk also introduces storage pools, enabling the pooling of IOPS and capacity across multiple volumes, thus optimizing resource usage and cost. These storage solutions are designed to support both stateless and stateful workloads, ensuring high availability and seamless failover capabilities.


Running Modern Workloads in Google Cloud

Event: Cloud Field Day 20

Appearance: Google Cloud Presents at Cloud Field Day 20

Company: Google Cloud

Video Links:

Personnel: William Denniss

William Denniss, product manager at Google, introduces GKE Autopilot during his presentation at Cloud Field Day 20. He explained that GKE Autopilot is a simplified way of using Google Kubernetes Engine (GKE) by focusing on Kubernetes as the primary interface, eliminating the need for users to manage the underlying infrastructure. Denniss emphasized that Kubernetes sits between traditional virtual machines (VMs) and fully managed services like Cloud Run, offering a balanced approach that provides flexibility without the complexity of managing low-level resources. He highlighted that Kubernetes is particularly beneficial for complex workloads, such as high-availability databases and AI training jobs, which require robust orchestration capabilities.

Denniss discussed the traditional challenges of managing Kubernetes, such as configuring node pools and handling security concerns. He explained that GKE Autopilot addresses these issues by collapsing the complex layers of infrastructure management into a more streamlined process. With Autopilot, users only need to interact with the Kubernetes API, while Google manages the underlying VMs and other infrastructure components. This approach reduces the administrative burden on users and allows them to focus on their workloads rather than the intricacies of infrastructure management. Denniss also mentioned that this model shifts the responsibility for infrastructure issues to Google, providing users with a more reliable and hands-off experience.

Discussing this solution with the delegates, Denniss concluded by emphasizing the importance of understanding the trade-offs between control and convenience, suggesting that while Autopilot may not be suitable for every use case, it offers significant benefits for those looking to simplify their Kubernetes management.


Running Enterprise Workloads in Google Cloud

Event: Cloud Field Day 20

Appearance: Google Cloud Presents at Cloud Field Day 20

Company: Google Cloud

Video Links:

Personnel: Jeff Welsch

Jeff Welsch, product manager at Google Cloud, discusses the opportunity of running enterprise workloads in the cloud, emphasizing that enterprise use cases are substantial for many customers. He outlines Google’s compute organization, which includes offerings such as virtual machines, TPUs, GPUs, block storage, and enterprise solutions like VMware, SAP, and Microsoft. Welsch explains that Google Cloud is focused on optimizing infrastructure to meet customer requirements, especially in light of challenges like increasing compute demands from AI and the plateauing of Moore’s Law. Google Cloud’s approach involves leveraging AI capabilities and modern infrastructure to improve performance, reliability, security, and cost efficiency, while also prioritizing sustainability.

Welsch introduces Google’s Titanium technology, which aims to optimize infrastructure by breaking out of traditional server limitations and disaggregating performance capabilities. Titanium allows for tiered offloading, improving CPU responsiveness and storage performance, as exemplified by the HyperDisk service. He highlights that Titanium enables better optimization and efficiency, providing benefits like reduced latency and improved price performance without requiring customers to consume more resources. Additionally, Titanium supports dynamic resource management, allowing for live migration and non-disruptive maintenance, which enhances the overall reliability and performance of enterprise workloads.

The presentation also covers specific enterprise workloads like Microsoft, VMware, and SAP. Google Cloud offers robust support for Microsoft workloads, with features like cost optimization, live migration, and integration with AI-based modernization tools. For VMware, Google Cloud provides a seamless, integrated experience with the Google Cloud VMware Engine, facilitating easy migration and access to Google Cloud services. SAP workloads benefit from Google Cloud’s memory-optimized instances and tight integration with AI and machine learning capabilities. Welsch concludes by emphasizing Google Cloud’s commitment to optimizing infrastructure to meet the diverse needs of enterprise applications, ensuring performance, reliability, and cost-effectiveness.


Google Cloud Overview and Cloud Field Day Introduction

Event: Cloud Field Day 20

Appearance: Google Cloud Presents at Cloud Field Day 20

Company: Google Cloud

Video Links:

Personnel: Bobby Allen

In this presentation, Bobby Allen from Google Cloud provides an overview of the themes and topics to be discussed during their full-day session at Cloud Field Day 20. He begins by acknowledging the vast scope of Google Cloud, noting that this presentation focuses on foundational topics like storage, networking, and security, as well as the importance of AI in today’s tech landscape. Specifically, he discusses AI’s integration into the platform and its role in the software development lifecycle (SDLC). Throughout the presentation Allen introduces his “Bobby-isms” and frames the discussion with key considerations to ponder throughout the day.

Allen underscores that Google Cloud is not just another cloud provider but a platform that supports billions of users globally, requiring robust, planet-scale infrastructure. He introduces the concept of Google Distributed Cloud, which offers various solutions for those who can’t always use the public cloud due to regulatory or operational constraints. These solutions include software-only options, Google-connected hardware, and air-gapped solutions for environments with limited connectivity. He also mentions modernization tools like Migration Center and Migrate to Containers, which help transition legacy workloads to more modern architectures like containers and serverless computing.

Throughout the presentation, Allen emphasizes the importance of balancing new technologies with existing, proven solutions. He introduces the idea that AI is not an end in itself but a means to enhance other applications and use cases. Using the analogy of AI as a “sauce” that improves the “dish” (the core application), he stresses the need for practical, customer-focused solutions. Allen also differentiates between incremental improvements (Neos) and groundbreaking innovations (Kainos), urging a balanced approach to technology adoption.


Introducing MVM: An Embedded KVM-Based Hypervisor Solution from Morpheus

Event: Cloud Field Day 20

Appearance: Morpheus Data Presents at Cloud Field Day 20

Company: Morpheus Data

Video Links:

Personnel: Brad Parks, David Estes

In addition to enabling self-service provisioning of VMs, Containers, and Application Services, the Morpheus orchestration platform has been extended to enable simple provisioning of Kubernetes, Docker, and KVM clusters.

As part of our ongoing effort to simplify IT, we’re introducing an embedded KVM-based hypervisor solution that customers can use to host workloads alongside the dozens of third- party hypervisors and clouds that Morpheus already supports natively or via plugin.

The session will showcase the Beta availability of MVM including the ability to orchestrate cluster deployment, configure hyperconverged or external storage, manage networking, enable VM live migration, handle snapshots, and more.

Details, demo, and a downloadable community edition at www.morpheusdata.com.


Morpheus Platform Architecture: A Force Multiplier for Enterprise IT

Event: Cloud Field Day 20

Appearance: Morpheus Data Presents at Cloud Field Day 20

Company: Morpheus Data

Video Links:

Personnel: Martez Reed

For years, the Morpheus platform has enabled rapid integration of third-party technologies with dozens of out-of-the-box codeless integrations into clouds, ITSM, IPAM, Backups, and more. With the general availability of the Morpheus plugin framework in 2023 and updates to the Morpheus distributed worker, we are ushering in another level of extensibility.

This session showcases how the platform framework has evolved and will highlight how third-party vendors, partners, and customers plugins are using it to abstract vendor dependencies, mitigate the impact of skills gaps, and adapt to new use cases across Edge, Hybrid Cloud, and AI.

Details, demo, and a downloadable community edition at www.morpheusdata.com.


Morpheus Tech Foundation: Developer Self-Service and Platform Operations

Event: Cloud Field Day 20

Appearance: Morpheus Data Presents at Cloud Field Day 20

Company: Morpheus Data

Video Links:

Personnel: David Estes

This CFD live demo will show how enterprises and MSPs can enable speed and agility while also improving control, efficiency, and flexibility. The session will cover IaaS, DBaaS, CaaS, PaaS, AIaaS, and other core Morpheus use cases including runbook automation.

We’ll show how easy it is to integrate technologies like VMware, Nutanix, AWS, Azure, GCP, ServiceNow, Terraform, Ansible, and more while providing a governance framework and policy engine to bring Developers, Security, Finance Operations teams closer together.

Details, demo, and a downloadable community edition at www.morpheusdata.com.


Morpheus in 2024: The Criticality of Platform Thinking

Event: Cloud Field Day 20

Appearance: Morpheus Data Presents at Cloud Field Day 20

Company: Morpheus Data

Video Links:

Personnel: Brad Parks

With increasing diversity in application formats, workload requirements, hypervisor options, cloud providers, automation approaches, and data locality there is simply not enough IT funding or skilled practitioners to keep up.

The only sustainable path is to embrace unified and extensible platform frameworks that accept this divergence while at the same time shrinking the distance between people, processes, tools, and technologies.

In this CFD intro session we’ll set a foundation for developer enablement, IT efficiency, technology independence, and unified orchestration across edge, datacenter, co-location, and hyperscale providers.

Details, demo, and a downloadable community edition at www.morpheusdata.com.


Automated Congestion Management in the AI Data Center with Juniper Networks

Event: Cloud Field Day 20

Appearance: Juniper Networks Presents at Cloud Field Day 20

Company: Juniper Networks

Video Links:

Personnel: Rajagopalan Subrahmanian, Vikram Singh

To maximize throughput and minimize packet loss, Ethernet uses the DCQCN congestion management protocol, but DCQCN introduces significant operational complexity for human operators. Learn how Juniper Apstra handles this new challenge in stride, automatically optimizing throughput and the “right amount” of packet loss.

Juniper Networks’ presentation at Cloud Field Day 20, led by Rajagopalan Subrahmanian and Vikram Singh, focused on automated congestion management in AI/ML data center fabrics. They began by explaining the challenges faced by network administrators in managing congestion, drawing an analogy to metering lights on freeways that regulate traffic flow. In AI/ML environments, the complexity increases due to the large number of entities that need monitoring and the manual, error-prone process of tuning congestion parameters. Juniper’s solution integrates with their Apstra platform to automate this process, leveraging continuous monitoring and closed-loop automation to optimize network performance dynamically.

The core of Juniper’s approach involves a DCQCN AutoTune application that utilizes Apstra’s capabilities to monitor key performance indicators and adjust network configurations in real-time. By simulating high-traffic scenarios in their lab, they demonstrated how the system detects congestion and uses Terraform to tweak configurations across the network fabric. This automated process helps maintain optimal throughput and the right amount of packet loss, adjusting parameters based on real-time data rather than static, manual settings. The system can apply changes selectively to affected switches or more broadly across similar network segments to preempt potential issues.

Juniper’s method combines two Ethernet congestion control mechanisms: Priority Flow Control (PFC) and Explicit Congestion Notification (ECN). PFC acts as a brute-force method to stop traffic when buffers are nearly full, while ECN offers a more granular approach by marking packets to signal congestion and prompt sender devices to reduce their transmission rates. The DCQCN protocol judiciously uses both techniques to manage congestion effectively. Juniper’s automation adjusts these settings dynamically, ensuring that the network remains stable and efficient under varying loads. The presentation highlighted the flexibility and potential for further customization, including integration with application-level metrics and additional congestion indicators from SmartNICs.


Design, Deploy, and Operate AI Clusters like a Pro with Juniper Networks

Event: Cloud Field Day 20

Appearance: Juniper Networks Presents at Cloud Field Day 20

Company: Juniper Networks

Video Links:

Personnel: Jay Wilson

Struggling with where to start with your on-prem AI training cluster? Juniper validated designs (JVDs) are rigorously pre-tested to make sure your deployments are relatively pain-free, and we now offer JVDs to meet the specific needs of AI data centers. See how Apstra does the Day 0/1/2 heavy lifting for you with intent-based automation.

Jay Wilson, an architect at Juniper Networks, presented at Cloud Field Day 20, focusing on the deployment and management of AI clusters using Juniper’s Apstra software. Wilson emphasized that Apstra is designed to manage data center fabrics rather than entire data centers, highlighting its ability to handle multiple fabrics and even multiple data centers from a single instance. The presentation aimed to demonstrate how Apstra’s intent-based automation can simplify the complex processes involved in setting up and maintaining AI training clusters. Wilson, with his extensive background in high-performance computing (HPC), underscored the importance of intent as the foundation of Apstra, ensuring that any changes made are validated against predefined goals, thus preventing misconfigurations.

Wilson provided a detailed walkthrough of how Apstra works, particularly focusing on its telemetry and configuration management capabilities. He explained that Apstra collects custom telemetry data, such as explicit congestion notifications (ECNs) and priority flow control (PFC) counters, to monitor the health and performance of AI clusters. This data is crucial for maintaining a lossless environment, which is vital for AI workloads. Wilson also discussed the use of configlets—small pieces of code that allow for fine-tuned adjustments to the network configuration. These configlets are essential for tailoring the environment to meet specific needs without disrupting the overall fabric management that Apstra provides.

The presentation also covered the operational aspects of using Apstra, including its anomaly detection and rollback features. Wilson demonstrated how Apstra’s single source of truth model ensures that all changes are validated and committed atomically, thus maintaining the integrity of the network. He showed how Apstra can identify and troubleshoot issues in real-time, using a combination of service and probe anomalies to pinpoint problems. Additionally, Wilson highlighted the importance of Apstra’s ability to roll back configurations to a previous state, a feature that is particularly useful in dynamic environments where multiple teams are making frequent changes. This capability ensures that any unintended disruptions can be quickly mitigated, thereby maintaining the stability and performance of the AI clusters.


Your Private AI Data Center, as Easy as Cloud with Juniper Networks

Event: Cloud Field Day 20

Appearance: Juniper Networks Presents at Cloud Field Day 20

Company: Juniper Networks

Video Links:

Personnel: Nick Davey

Achieve public cloud-like service consumption in your on-prem AI data center with Apstra and Terraform. Apstra and the Terraform Provider for Apstra fit traditionally complex network services, like EVPN, neatly into a predefined application automation. This session shows how network teams can self-serve network services in a familiar way, providing seamless deployments across any infrastructure for new AI/ML workloads.

Nick Davey from Juniper Networks discussed how the increasing complexity and scale of modern data centers, particularly for AI and ML workloads, necessitate advanced automation solutions. He introduced Juniper Validated Designs (JVDs) and highlighted the AI JVD, which includes not just architectural diagrams and configurations but also the required automation to bring these designs to life using Apstra. The AI JVD allows for flexible, scalable, and automated deployment of AI data centers, making it possible to adapt to specific network requirements and efficiently manage the intricate configurations needed for AI workloads.

Central to this automation is Apstra, which provides a cloud-like API to manage physical data center resources. Apstra’s design cycle allows network engineers to move from traditional whiteboard planning to orchestrated, automated design and deployment. This process involves designing the network in Apstra, assigning physical resources, and deploying the network using Zero Touch Provisioning (ZTP). Apstra also supports continuous monitoring and optimization, ensuring the network remains in the desired state, which is crucial for handling the demanding and complex nature of AI workloads.

The integration with Terraform further enhances automation capabilities, allowing for bulk operations and dynamic infrastructure management. Terraform’s declarative approach complements Apstra’s intent-based networking, enabling users to manage their data centers as code. This integration facilitates seamless deployment and management of AI data centers, ensuring that network changes and configurations can be version-controlled, tested, and automated. Additionally, the use of ServiceNow as a front-end interface allows non-networking personnel, such as data scientists, to request and provision infrastructure without needing to understand the underlying complexities, thus democratizing access to AI resources and streamlining operations.


Networks Myths to Solutions – Juniper’s Approach to AI Data Centers

Event: Cloud Field Day 20

Appearance: Juniper Networks Presents at Cloud Field Day 20

Company: Juniper Networks

Video Links:

Personnel: Praful Lalchandani

In the rapidly evolving landscape of AI-driven data centers, misconceptions abound, often clouding the decision-making process for IT professionals. In this session, we embark on a myth-busting journey to unveil the realities of AI data centers. This session will debunk common misconceptions and showcase how Juniper Networks’ cutting-edge networking solutions can transform and optimize your data center infrastructure.

Juniper Networks’ presentation at Cloud Field Day 20, led by Praful Lalchandani, focused on debunking myths surrounding AI-driven data centers and demonstrating how Juniper’s networking solutions can optimize these environments. Lalchandani emphasized the significant role of networking in maximizing the return on investment (ROI) for expensive GPU assets, which are central to AI projects. He outlined Juniper’s mission to deliver high network performance (400 gig and 800 gig), ease of operations through Appstra, and a lower total cost of ownership compared to proprietary technologies. The presentation included a detailed explanation of the AI application lifecycle, from data gathering and preprocessing to training and inference, highlighting Juniper’s solutions for both training and inference clusters.

The discussion delved into the specific network requirements for AI training and inference clusters. Lalchandani explained the critical role of the network in job completion time for training models, which involves complex traffic patterns due to data parallelism and model parallelism. He described the different types of networks involved: the backend GPU training network, the backend dedicated storage network, and the frontend network for external connectivity and orchestration. Lalchandani also addressed the importance of eliminating network bottlenecks to prevent GPUs from idling, which would otherwise lead to inefficiencies and increased costs. The presentation highlighted Juniper’s approach to achieving these goals through advanced load balancing techniques and congestion control mechanisms.

In addition to performance optimization, Lalchandani discussed the economic advantages of using Ethernet over InfiniBand for AI data centers. He presented evidence from Juniper’s AI innovation lab and customer use cases, showing that Ethernet can match InfiniBand’s performance while being more cost-effective and operationally simpler. The presentation also tackled the myth that packet spraying is necessary for maximizing performance, demonstrating that Juniper’s dynamic load balancing and Flowlet techniques can achieve near-optimal results without requiring expensive SmartNICs. Finally, Lalchandani touched on the importance of lossless networking, showing that while 100% lossless networking is not always required, Juniper’s solutions can adapt to different model sensitivities, providing flexibility and efficiency in AI data center operations.


Seize the AI Moment with Juniper Networks

Event: Cloud Field Day 20

Appearance: Juniper Networks Presents at Cloud Field Day 20

Company: Juniper Networks

Video Links:

Personnel: Mansour Karam

AI transformation is here. Is your data center up to the task? Juniper’s AI-Native Networking Platform is optimized for the connectivity, data, volume, and speed requirements of mission-critical AI workloads. Our AI data center solution offers the fastest and most flexible way to deploy reliable high-performing AI training, inference, and storage clusters, and the simplest to operate with limited IT resources.

Mansour Karam, Global VP for Data Center at Juniper Networks, highlighted the rapid advancements and transformative potential of AI in his presentation at Cloud Field Day 20. He likened the current excitement surrounding AI to the early days of significant technological breakthroughs, such as the advent of x86 processors and the introduction of the iPhone. Karam emphasized that Juniper’s AI-Native Networking Platform is designed to meet the demanding connectivity, data volume, and speed requirements of AI workloads. The platform aims to simplify the deployment and operation of AI training, inference, and storage clusters, even with limited IT resources, ensuring high performance and reliability.

Karam detailed the evolution of networking technologies and their pivotal role in supporting various technological revolutions, from the internet and cloud to mobile and digital transformations. He underscored the importance of Ethernet in AI clusters, noting its widespread deployment, extensive ecosystem, and cost-effectiveness compared to specialized networks like InfiniBand. Juniper has been at the forefront of these advancements, offering both Broadcom-based and custom solutions to deliver high-performance Ethernet capabilities. The company has also focused on AI for networking, leveraging AI to transform network operations and enhance reliability, speed, and efficiency through automation and intent-based networking.

Furthermore, Karam introduced Apstra, a tool acquired by Juniper, which automates the entire lifecycle of data center networks using intent-based networking. Apstra ensures reliability and speed in network operations by pre-validating configurations and maintaining a closed-loop system for continuous telemetry collection and real-time interaction with the network. Juniper’s AI-driven approach also includes probabilistic insights and application awareness, providing comprehensive visibility and control over network performance. By combining these elements, Juniper aims to offer a robust, flexible, and vendor-agnostic solution for managing AI workloads and optimizing network performance, thus positioning itself as a leader in the AI networking wave.


Opengear Continues to Innovate in OOB Management – Hybrid Cloud Deployment Strategy

Event: Tech Field Day Extra at Cisco Live US 2024

Appearance: Opengear Presents at Tech Field Day Extra at Cisco Live US 2024

Company: Opengear

Video Links:

Personnel: Jeff Blyther

In this session, learn how Opengear specializes in out-of-band management. Their failover to cellular feature ensures your network is capable of being managed under the toughest of conditions. Opengear offers first day, every day, and worst day solutions for deployment, management, and remediation. You can get access to devices via cellular connectivity and support access to those devices via an intuitive GUI.

Lighthouse is also an important feature central to management. It supports multi tenancy and quick deployment and integration of devices through the Lighthouse Service Portal.

New this year is Smart Management Fabric (SMF), which allows for traffic to be routed through a WireGuard VPN. Connected Resource Gateway (CRG) gives you direct access to the GUI on network devices. Opengear continuously innovates with new features and compliance considerations for future enhancements.


Digital Experience Assurance with Cisco ThousandEyes

Event: Tech Field Day Extra at Cisco Live US 2024

Appearance: Cisco ThousandEyes Presents at Tech Field Day Extra at Cisco Live US 2024

Company: Cisco

Video Links:

Personnel: Marko Tisler

In this video, you’ll discover how to harness the combined power of Cisco Networking Cloud, ThousandEyes Digital Experience Assurance, and Cisco’s unmatched dataset. Learn how these cutting-edge innovations provide proactive insights and automate operations across your entire digital ecosystem, encompassing both your own and third-party environments.


Cisco Secure Connect and Client ZTNA with New SSE Engine powered by Cisco Secure Access

Event: Tech Field Day Extra at Cisco Live US 2024

Appearance: Cisco Networking Presents at Tech Field Day Extra at Cisco Live US 2024

Company: Cisco

Video Links:

Personnel: Mark Townsley, Vinny Parla

In this session, learn how Cisco Zero Trust Network Architecture integrates with the larger Cisco SASE platform. Cisco Secure Client provides network access with zero trust principles, including posture checking and user authentication without the need for a client. Cisco’s Secure Client uses a streaming ecosystem without IP packet forwarding to enhance security though obfuscation of internal network details. The client also has built-in microsegmentation for application-level policy enforcement.

Secure Client also uses the new MASQUE protocol for proxying QUIC connections to enhance setup speed, resiliency, and provide granular application controls. It is supported on Apple operating systems, Samsung devices, and Windows with device specific enrollment and traffic interception mechanisms. Certificates are stored in the device’s secure enclave or TPM to bind them to that device and secure them from theft.

Existing VPN and clienteles access methods remain available alongside the Cisco Secure Client. The ZTNA proxy runs in the cloud without an on-premises requirement.