Data Centre Cooling for AI Workloads

By David Watkins, Solutions Director at VIRTUS Data Centres.

In the rapidly evolving landscape of artificial intelligence (AI), the demand for computational power is soaring, placing unprecedented strain on data centres around the globe. The burgeoning scale and intensity of AI workloads bring with them a critical challenge: managing heat. As AI models grow more sophisticated and data processing becomes more intensive, the thermal management of data centres has become a pivotal concern. Effective cooling strategies are not merely about maintaining optimal operating temperatures, they are integral to ensuring system reliability, maximising performance, and minimising operational costs.

Data centre cooling is no longer a peripheral consideration but a central component of data centre design and operation. Traditional cooling methods, such as air-based systems, might struggle to keep pace with the thermal demands of modern AI hardware. These systems often rely on cold air being circulated through the data centre, which is then heated by the high-density compute equipment. However, as AI workloads generate an ever-increasing amount of heat, this approach can lead to inefficiencies and higher energy consumption. To address these challenges, data centres are exploring more advanced cooling solutions methods.

The Liquid Alternative

Liquid cooling has emerged as a transformative addition to traditional air-based cooling systems, offering significant advantages in managing the heat produced by high-performance computing environments. By utilising liquid to cool components directly, this approach achieves more efficient heat dissipation compared to air, enhancing thermal management across data centres. This not only boosts cooling efficiency but also helps lower the overall energy footprint, making it an appealing solution for handling intensive AI workloads.

Two primary approaches to liquid cooling are especially noteworthy: immersion cooling and direct-to-chip cooling.

Immersion cooling involves submerging IT hardware - such as servers and graphics processing units (GPUs) - in a specially designed dielectric fluid, such as mineral oil or synthetic coolants. This fluid effectively absorbs and dissipates heat directly from the components, eliminating the need for traditional air-cooled systems. By directly cooling the hardware, immersion cooling greatly enhances energy efficiency and reduces operating costs. This method is particularly advantageous for AI workloads, which often generate substantial amounts of heat and benefit from the enhanced thermal management and reduced energy consumption offered by immersion cooling.

In contrast, direct-to-chip cooling, also known as microfluidic cooling, focuses on delivering coolant directly to the heat-generating components of servers, such as central processing units (CPUs) and GPUs. This approach maximises thermal conductivity by targeting heat dissipation at the source, thus improving overall performance and reliability. By addressing the cooling needs of critical components more precisely, direct-to-chip cooling minimises the risk of thermal throttling and hardware failures. This method is especially crucial for data centres managing high-density AI workloads, where maintaining peak operational efficiency and system stability is essential for optimal performance.

Overall, both immersion and direct-to-chip cooling technologies provide effective solutions for the heat management challenges posed by advanced AI workloads, offering data centres the ability to enhance cooling performance while reducing energy consumption and operational costs.

A Flexible Approach

The versatility of liquid cooling technologies offers data centre operators a strategic advantage, enabling them to customise cooling solutions to meet specific infrastructure and AI workload needs. By employing a mix-and-match approach, data centres can enhance cooling efficiency and adapt to diverse cooling requirements. Integrating different cooling technologies - such as immersion cooling, direct-to-chip cooling, and air cooling - allows for the optimisation of heat management across various components and workload types. This flexibility ensures that each technology's strengths are maximised while mitigating its limitations.

This approach addresses the varied heat dissipation characteristics of diverse AI hardware configurations. By tailoring cooling solutions to specific workload demands, data centres can maintain system stability and performance. As AI workloads and data centre requirements evolve, a scalable and adaptable cooling infrastructure becomes crucial. Combining multiple cooling methods supports future upgrades and expansion without compromising efficiency. For instance, while air cooling remains essential for certain high-performance computing and networking components, the primary heat rejection systems, like chillers, must be appropriately sized and capable of heat reuse to manage waste heat effectively. This integrated strategy not only enhances cooling performance but also prepares data centres for ongoing advancements and growing demands.

Whilst liquid cooling has emerged as the preeminent solution for addressing the thermal management challenges posed by AI workloads, it’s also important to understand that air cooling systems will continue to be part of the data centre infrastructure for the foreseeable future.

Navigating Challenges

Because AI workloads involve diverse hardware with varying cooling needs, it is challenging to design a one-size-fits-all cooling solution. Advanced systems must be tailored to specific components, requiring detailed engineering and testing. This need for customisation can lead to compatibility issues and increase the complexity of the cooling infrastructure.

Despite the significant potential benefits, the adoption of advanced cooling technologies in data centres - particularly for AI workloads - presents additional challenges that must be overcome. Perhaps most pressingly, the initial expense of advanced cooling systems, such as liquid and immersion cooling, can be substantial. These technologies require significant investment in new equipment and infrastructure modifications. Integrating these systems into existing data centres is complex and can disrupt operations, particularly in older facilities not designed for such upgrades.

Advanced cooling solutions also demand specialised maintenance and skilled personnel. Systems like immersion cooling need careful monitoring to prevent issues such as leaks, and finding or training staff with the necessary expertise can be difficult. Ongoing maintenance is essential to prevent downtime and maintain efficiency, adding another layer of operational complexity.

As AI technologies evolve, cooling systems must continue to be scalable and adaptable. Ensuring that these systems can be upgraded or expanded without significant additional costs or disruptions is crucial for maintaining long-term operational efficiency.

What lies ahead?

As data centres grapple with the soaring thermal demands of AI workloads, the adoption of advanced cooling technologies represents both a crucial necessity and a promising opportunity. Liquid and hybrid cooling solutions offer significant advantages in managing the intense heat generated by modern AI applications, facilitating greater efficiency and performance. Realising these benefits involves navigating a complex landscape of high costs, intricate integration processes, and specialised maintenance requirements.

The shift towards these innovative cooling methods necessitates careful planning and a strategic approach. Investing in advanced cooling infrastructure not only addresses immediate thermal management issues but also positions data centres to handle future technological advancements. As AI continues to drive innovation and expand its applications, data centres must remain agile and forward-thinking, integrating cooling solutions that can evolve with emerging demands.

The effective adoption of these technologies contributes to a broader goal of sustainability. By optimising cooling performance and reducing energy consumption, data centres can minimise their environmental footprint while supporting the growing needs of AI and other high-performance computing tasks. As the industry moves forward, overcoming the challenges associated with advanced cooling will be key to achieving operational excellence and fostering a more sustainable, technologically advanced future.

Relocation and replacement of cooling infrastructure on mission critical live data centre.
The positive impact of data centres on people, society, business and government. By Ed Ansett,...
By Klaus Dafinger, Cooling Marketing Manager at Legrand.
By Jesse Hagar, Product Line Manager, Parker Chomerics.
By Mike Meyer, Managing Director of Portman Partners.
By Ian Jeffs, UK&I Country General Manager at Lenovo Infrastructure Solutions Group.
By Alan Stewart-Brown, VP EMEA, Opengear.
Independent Distribution Network Operators (IDNOs) have a vital role to play in energising data...