AI and GPU data centres: Navigating the networking challenge

By Alan Stewart-Brown, VP EMEA at Opengear.

The rise of artificial intelligence (AI) and its integration into industries has increasingly become a focal point worldwide. AI for IT operations (AI Ops), a practice that leverages AI to optimise and automate network management tasks, is widely expected to revolutionise network operations. However, to be effective, it requires a flexible, software-defined network control plane paired with secure remote access for provisioning, orchestration, management, and remediation. 

To deliver to its full potential, AI relies on immense computational power, much of which is delivered through modern data centres. These data centres, equipped with advanced graphics processing units (GPUs), have become the backbone of AI innovation. Powered by Moore's Law, GPUs have been critical in supporting the growing demands of AI workloads. According to MarketsandMarkets, the global data centre GPU market size was valued at US $14.3 billion in 2023, and it is estimated to reach US $63 billion by 2028, growing at a compound annual growth rate of 34.6 during the forecast period from 2023 to 2028.    

The elephant in the data centre: Networking bottlenecks

GPUs have revolutionised AI development due to their ability to process vast amounts of data simultaneously. This parallel processing is ideal for the complex computations required by deep learning and large language models like GPT. 

Yet as these models grow in complexity and size, they generate "elephant flows" – substantial data chunks that strain traditional ethernet networks. This leads to congestion and increased latency, creating bottlenecks that hamper performance. Ethernet, while ubiquitous and cost-effective, wasn't originally designed to handle such voluminous and high-speed data transfers.

This networking bottleneck has ignited a debate within the data centre community: Should the industry continue to rely on traditional ethernet networks, or explore alternative solutions better suited for AI workloads? Some argue that enhanced ethernet technologies, such as remote direct memory access (RDMA) over converged ethernet (RoCE), offer low-latency data transfer capabilities that can mitigate these issues. Others believe that entirely new networking paradigms may be necessary to meet the demands of AI-driven data centres.

Amid this technological tug-of-war, network management within GPU data centres faces its own challenges. Traditional network switches typically include console management ports for straightforward configuration, but many newer, high-speed switches lack these ports, relying instead on ethernet management interfaces. This discrepancy necessitates a re-evaluation of management strategies to ensure seamless operation regardless of the underlying networking technology.

  

Adapting network management for AI’s future

Independent overlay management networks emerge as a viable solution, providing a unified management layer that interfaces with both ethernet and serial connections. This approach ensures data centre operators maintain robust control over their networks, enabling secure remote access for provisioning, orchestration, management, and remediation tasks. By decoupling the management plane from the data plane, these overlay networks offer the flexibility and resilience required in the evolving landscape of GPU data centres.

However, as networks grow in complexity, relying solely on in-band management can be risky. This is where out of band management becomes crucial, providing a dedicated pathway that operates independently of the primary network infrastructure. In the event of network failures or disruptions caused by heavy AI data loads, out of band access allows administrators to remotely manage and troubleshoot devices without relying on the main network. This ensures minimal downtime and maintains operational continuity, critical when dealing with AI workloads where any interruption can lead to significant productivity losses.

Integrating out of band management solutions enhances resilience, ensuring continuous operations even under strain. This dedicated channel allows swift issue resolution, safeguarding AI application performance and reliability.

The broader challenge lies not just in selecting specific networking technologies but in designing infrastructure capable of meeting the ever-increasing demands of AI workloads. Data centres must prioritise flexibility, scalability, and security in their network designs. Embracing software-defined networking (SDN) creates a flexible control plane that dynamically adjusts to shifting workloads and network conditions. This adaptability is crucial for handling the unpredictable nature of AI data flows.

As edge computing and IoT devices generate more data at the network's periphery, data centres must extend capabilities beyond centralised locations. This expansion highlights the need for resilient infrastructure and cost-effective edge deployments. Implementing robust network management solutions across distributed environments ensures data integrity and availability, regardless of where data is generated or consumed. Without these measures, including both in-band and out-of-band management, the vast volumes of data risk being underutilised, limiting their potential to produce actionable insights and meaningful change.

Navigating the networking challenge in AI-powered data centres requires a commitment to innovation and agility. Organisations must remain open to adopting new technologies that enhance network performance and management. Integrating AI into operations can improve efficiency and reduce the likelihood of human error. This proactive approach enables businesses to pre-emptively address issues before they escalate into significant problems.

Yet there is no one-size-fits-all solution. The choice of networking infrastructure may vary based on specific use cases, budget constraints, and scalability requirements. What remains constant is the need for a robust, flexible network management strategy that accommodates current demands and future growth.

In this rapidly evolving landscape, robust network management remains essential for ensuring performance, scalability, and security in AI-powered data centres. Strategic use of out of band management, combined with innovative technologies, enables data centres to handle the growing demands of AI workloads while maintaining operational continuity. By adopting flexible and resilient infrastructure, organisations can unlock AI's full potential, driving innovation and thriving in an increasingly data-driven world.

By Mark Dando, General Manager of EMEA North, SUSE.
By Peter Miles, VP of Sales VIRTUS Data Centres.
By James Moore, VP EMEA Sales at DoiT International.
By Ian Tickle, SVP and GM International at Freshworks.
How artificial intelligence may upend the hardware-agnostic era. By James White, Digital...
In this Q&A, Darren Rushworth, President of NICE International, explores how purpose-built AI is...
Satish Thiagarajan, Founder and CEO of UK consulting firm Brysa, compares BPA and RPA, when to use...