Data Center GPUs Have Short Lifespan: 1-3 Years

The issue of shortened GPU lifespan is having a severe economic impact on the AI industry.

Tech Fund, citing a senior expert from Alphabet, reports that the lifespan of data center GPUs may be only one to three years, depending on utilization. Since GPUs handle all the heavy lifting for AI training and inference, they are constantly under significant load, leading to faster performance degradation than other components.

In data centers operated by cloud service providers (CSPs), GPU utilization for AI workloads ranges from 60% to 70%. According to @techfund, under such utilization, GPUs typically last one to two years, up to a maximum of three years. This claim is attributed to a chief generative AI architect at Alphabet.

Since the name of this self-proclaimed “GenAI chief architect from Alphabet” cannot be verified, readers should not fully rely on this statement. Nevertheless, many find it plausible, as modern data center GPUs for AI and HPC applications consume and dissipate 700W or more, a substantial stress for tiny silicon chips.

One way to extend GPU lifespan is to reduce its utilization. However, this means GPUs depreciate more slowly, leading to slower capital returns, which isn’t ideal for business. As a result, most cloud service providers opt for high-utilization usage of GPUs.

Earlier this year, Meta released a study on the training of the Llama 3 405B model on a cluster powered by 16,384 Nvidia H100 80GB GPUs. The cluster’s Model Flop Utilization (MFU) was about 38% (using BF16), but of 419 unforeseen interruptions during the 54-day pre-training snapshot, 148 (30.1%) were due to various GPU failures (including NVLink failures), while 72 (17.2%) were due to HBM3 memory errors.

Meta’s results appear to be relatively favorable for the H100 GPUs. If the GPUs and their memory continue to fail at Meta’s observed rate, these processors would have an annualized failure rate of around 9%, with a three-year annualized failure rate of about 27%, although failure rates might increase after one year of GPU use.

Impact on the AI Industry and Countermeasures
The problem of shortened GPU lifespan is significantly impacting the economics of the AI industry. A typical example is OpenAI, a leader in the AI field, which is forecasting losses of $5 billion by 2024, despite strong backing from Microsoft. One major reason for this loss is the cost of computing resources needed for training and operating large-scale language models.

Additionally, Google continues to invest heavily in boosting its AI processing capacity, spending $13.2 billion on AI processing hardware in the second quarter of 2024 alone. However, these investments are losing their previously assumed nature of long-term capital investment. If equipment needs to be renewed in a mere three-year cycle, the prospects for investment recovery will inevitably change.

To address this issue, some data center operators are attempting to extend GPU lifespan by deliberately lowering GPU operating rates. However, this approach comes at a great cost: lower operational rates, extended equipment depreciation periods, and reduced investment efficiency. This dilemma is becoming a structural challenge for the entire AI industry.

The AI industry is at a crucial turning point, with the obvious reality of shortened GPU lifespan. This is not merely a technical issue but has the potential to affect the entire industry structure.

Firstly, it is necessary to reconsider the concept of capital investment. Investment plans based on the traditional three-year depreciation period are no longer realistic. AI companies will face increasing pressure to monetize quickly, shifting towards short-term investment recovery strategies.

Ironically, this situation is further strengthening the market power of NVIDIA, which holds a dominant share in the GPU market. By June 2024, the company’s market capitalization reached $3 trillion, with steady GPU demand supporting continued growth.

However, the more fundamental challenge lies in the sustainability of the current AI business model. Given the reality of shorter hardware lifespans, the current method of developing and operating large-scale language models requires immense computing resources and may need to be fundamentally rethought.

Recently, Raghib Hussain, CTO of Marvell, also pointed out that the cost of AI computing is prohibitively high for most top chip manufacturers, making it a major challenge as they strive to keep up with AI technology.

In the future, the AI industry will be forced to focus on developing more efficient model architectures and innovative learning methods. Additionally, the competition to develop dedicated AI accelerators as alternatives to GPUs is expected to intensify. The problem of shortened GPU lifespan could influence the direction of AI technology development. Addressing these technical and economic challenges will be crucial in determining the future competitiveness of AI enterprises.

End-of-Yunze-blog

Disclaimer:

  1. This channel does not make any representations or warranties regarding the availability, accuracy, timeliness, effectiveness, or completeness of any information posted. It hereby disclaims any liability or consequences arising from the use of the information.
  2. This channel is non-commercial and non-profit. The re-posted content does not signify endorsement of its views or responsibility for its authenticity. It does not intend to constitute any other guidance. This channel is not liable for any inaccuracies or errors in the re-posted or published information, directly or indirectly.
  3. Some data, materials, text, images, etc., used in this channel are sourced from the internet, and all reposts are duly credited to their sources. If you discover any work that infringes on your intellectual property rights or personal legal interests, please contact us, and we will promptly modify or remove it.

Leave a Reply