With the rapid development of cloud computing, big data, computing power, and AI technology, traditional air-cooling techniques in data centers are increasingly unable to cope with high heat density issues. Adopting more efficient liquid cooling technologies has become an inevitable outcome of AI advancement.
Currently, the AI server industry is moving towards liquid cooling as a consensus. As NVIDIA CEO Jensen Huang stated, immersion liquid cooling technology is set to become the mainstream direction in the future, bringing comprehensive innovation to server and data center cooling technologies. NVIDIA’s choice is seen as a significant trendsetter in the industry, expected to accelerate the growth of the global liquid cooling market for data centers.
However, at present, despite NVIDIA’s breakthroughs in AI chip cooling with the Blackwell architecture, the demand for chip cooling continues to rise, necessitating consideration of multiple practical requirements and conditions such as alternative technological solutions, existing facility conditions, overall restructuring costs, and specific needs. Consequently, the data center industry still predominantly relies on air cooling, while Blackwell architecture will introduce both air-cooled DGX and liquid-cooled MGX server solutions.
1. Blackwell Adoption of Air Cooling: A Departure from Convention?
After the release of NVIDIA’s Blackwell series, there were rumors in the market about servers using liquid cooling for heat dissipation.
From a technical perspective, liquid cooling technologies mainly include cold plate liquid cooling and immersion liquid cooling. Cold plate liquid cooling directly attaches metal water-cooled plates to CPUs and GPUs, using liquid flow to dissipate heat. Immersion cooling involves submerging server motherboards in coolants, often fluorinated liquids. Currently, due to the cost and complexity of retrofitting traditional data centers and the maturity of cold plate solutions, they are more commercially viable.
According to OpenSecurities research, Intel’s multiple CPUs have reached a TDP (thermal design power) of 350W, NVIDIA’s H100 SXM has a TDP of 700W, and B100 could reach around 1000W, nearing the limit of air-cooled single point heat dissipation. Meanwhile, individual cabinet power continues to rise, nearing the limit of air-cooled heat dissipation, leading to increased demand for inter-row air conditioning. In high-density heat dissipation scenarios, liquid cooling solutions offer significant cost and performance advantages.
However, despite the upcoming release of the Blackwell series, air cooling remains dominant. According to sources in the server supply chain, the B100 and B200 chips from the Blackwell series are expected to ship soon, with HGX systems featuring these chips slated for mass production in late Q3 this year, initially in small quantities. The GB200 series is expected to begin production in 2025.
Interestingly, server supply chain sources note that current AI server heat dissipation designs with B100 and B200 chips still primarily use 3D VC air cooling. Despite TDPs of 800W and 1000W respectively for B100 and B200, why do server manufacturers still prefer air cooling?
Industry experts state that for heat dissipation modules alone, TDPs above 500W are indeed nearing the limit for air cooling. However, when chips are integrated into motherboards, system designs can enhance air cooling airflow and heat dissipation capabilities, surpassing what modules alone can achieve. Liquid cooling’s heat dissipation capabilities are inevitably higher than those of air cooling. However, despite NVIDIA’s breakthroughs in heat dissipation, chip cooling requirements will only increase, but practical considerations dictate that data centers continue to favor air cooling.
Current data shows that traditional air cooling technology still holds over 90% market share in data centers, with liquid cooling accounting for less than 10%. With the rapid construction of AI servers, industry experts predict that by 2025, the market share of liquid cooling could reach 30%, with a market size exceeding 80 billion yuan, and a compound annual growth rate of 55% over five years.
Additionally, the Blackwell architecture will simultaneously launch the air-cooled DGX and liquid-cooled MGX server models.
Although NVIDIA’s new generation products do not mandate liquid cooling, it is nearly essential for maximizing the potential of their flagship chips. For B100, B200, and GB200, the main differences lie in power and performance.
According to NVIDIA, these chips’ operating power ranges can vary between 700W and 1200W, depending on the specific model and cooling method. The HGX B100 device with an air cooling system can achieve 14 petaflops per GPU, equivalent to the H100 in terms of power consumption. In the air-cooled HGX or DGX architecture, each B200 GPU can provide 18 petaflops of computing power, with a power consumption of up to 1000 watts.
However, in the AI data center field, shifting to liquid cooling is almost mandatory to unleash the full potential of Blackwell. Under liquid cooling configurations, chips can achieve up to 1200W of heat output at full load, while delivering 20 petaflops of performance. This poses significant challenges for existing facilities.
Industry observers note that unless newly built data centers consider future chip upgrade heat dissipation requirements using water-cooled designs, including adjustments to floor heights and pipe support, existing data centers will continue to prefer air cooling solutions. Moreover, due to the intensive computational demands of AI training, only the most efficient AI chips require such solutions.
Nevertheless, the trend towards liquid cooling in the AI server industry has become a consensus. Yao Yong, Vice President of Sugon Digital, pointed out that as AI and cloud computing technologies advance, data center power densities will continue to increase, making liquid cooling an effective solution to high-density problems. It is expected that within three years, liquid cooling and air cooling will share the market equally.
2. Challenges Facing the Expansion of Liquid Cooling Applications
Whether from NVIDIA’s GTC and Computex 2024 conferences or the overall industry’s technological developments and trends, various signs indicate the critical importance of cooling technologies for high-performance chips and data centers.
Chen Zhenxian, founder of NeoGene Tech in Guangzhou, remarked, “When single high-performance chip power reaches 1000 watts, existing cooling technologies will undergo a revolution. In the future, the chip wars will turn into cooling wars.”
Currently, major players in the industry are developing 3D VC air-cooled heat dissipation modules, capable of dissipating heat up to 600-700 watts with fans. However, their drawback lies in their bulky size, making them transitional products for data centers and high-end computing. With advancements driven by AI and data center technology iterations, the era of heat dissipation module development has begun, shifting towards more effective liquid cooling solutions.
As AI continues to expand, driving increased computational demands, and as chip and server power gradually exceed the capabilities of air cooling, liquid cooling has emerged as one of the best solutions. The liquid cooling heat dissipation market sales are projected to reach approximately $4.269 billion in 2024 and $6.215 billion in 2025, with an annual growth rate approaching 50%. Many industry experts anticipate 2024 as the year of liquid cooling, with 2025 poised for its full-scale adoption.
However, while liquid cooling represents the future direction, several critical issues still hinder its widespread adoption and application.
A recent “Telecom Operator Liquid Cooling Technology White Paper” released jointly by China’s three major operators highlighted challenges facing the liquid cooling industry. These include the lack of unified server and cabinet interface standards, diverse forms of server equipment, cooling fluids, refrigeration pipelines, power supply, and other components, preventing interoperability among products. Moreover, compared to traditional air cooling products, liquid cooling initially requires higher investment and overall lifecycle costs, impacting product scalability and deployment.
According to the China Data Center Research Institute (CDCC), surveyed liquid cooling technology developers, manufacturers, and users identified the most pressing needs for improvement as technical safety and reliability, cost control, and operation and management of liquid cooling systems. 76% of respondents agreed that continuous innovation and cost control are critical for industry development.
For example, the immersion liquid cooling solution using fluorinated liquids submerges entire servers. Lu Xiaobao, Managing Director of Zhongke Chuangxing, noted that while this approach enhances heat dissipation capabilities significantly, the cost of fluorinated liquids, dubbed “Maotai” (a high-end Chinese liquor brand), is higher than that of servers themselves, turning them from a supporting role to a central one in data center construction.
Furthermore, despite its effectiveness, liquid cooling poses risks such as leakage. Industry insiders pointed out that quick-connect fittings are currently in short supply for liquid-cooled servers, as leaks are highly undesirable and these fittings are particularly prone to leaks. This shortage may potentially bottleneck the deployment of liquid-cooled AI servers, even with NVIDIA’s imminent Blackwell shipments.
Nevertheless, driven by industry resonance, liquid cooling is gradually evolving from optional to essential. Cai Tong Securities forecasts that traditional air cooling cannot meet the heat dissipation demands of AI computing. The global market for liquid cooling-related products is expected to reach billions by 2027. With the demand for AI large models and NVIDIA’s leadership in AI hardware, the global data center liquid cooling market is set to accelerate its growth.
Related:
- More Fans on a GPU = Better Cooling? Here’s the Truth
- Improve PCB Cooling Easily with Smart Design Ideas
- ARGB Coolers Explained: Functions, Benefits & Selection Tips
- Cut Data Center Energy Costs with These 8 Methods
Disclaimer:
- This channel does not make any representations or warranties regarding the availability, accuracy, timeliness, effectiveness, or completeness of any information posted. It hereby disclaims any liability or consequences arising from the use of the information.
- This channel is non-commercial and non-profit. The re-posted content does not signify endorsement of its views or responsibility for its authenticity. It does not intend to constitute any other guidance. This channel is not liable for any inaccuracies or errors in the re-posted or published information, directly or indirectly.
- Some data, materials, text, images, etc., used in this channel are sourced from the internet, and all reposts are duly credited to their sources. If you discover any work that infringes on your intellectual property rights or personal legal interests, please contact us, and we will promptly modify or remove it.